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Foreword 


Once there was a thing called Twitter, where people exchanged short messages called 
‘tweets’. While it had its flaws, I came to like it and eventually decided to teach a short 
course on entropy in the form of tweets. This little book is a slightly expanded version 
of that course. 

It's easy to wax poetic about entropy, but what is it? I claim it's the amount of 
information we don't know about a situation, which in principle we could learn. But 
how can we make this idea precise and quantitative? To focus the discussion I decided 
to tackle a specific puzzle: why does hydrogen gas at room temperature and pressure 
have an entropy corresponding to about 23 unknown bits of information per molecule? 
This gave me an excuse to explain these subjects: 


• information 

• Shannon entropy and Gibbs entropy 

e the principle of maximum entropy 

• the Boltzmann distribution 

• temperature and coolness 

• the relation between entropy, expected energy and temperature 
• the equipartition theorem 

• the partition function 

• the relation between entropy, free energy and expected energy 
• the entropy of a classical harmonic oscillator 

• the entropy of a classical particle in a box 

• the entropy of a classical ideal gas. 


I have largely avoided the second law of thermodynamics, which says that entropy 
always increases. While fascinating, this is so problematic that a good explanation 
would require another book! I have also avoided the role of entropy in biology, black 
hole physics, etc. Thus, the aspects of entropy most beloved by physics popularizers 
will not be found here. I also never say that entropy is ‘disorder’. 

I have tried to say as little as possible about quantum mechanics, to keep the physics 
prerequisites low. However, Planck's constant shows up in the formulas for the entropy 
of the three classical systems mentioned above. 'ТПе reason for this is fascinating: 
Planck's constant provides a unit of volume in position-momentum space, which is 
necessary to define the entropy of these systems. Thus, we need a tiny bit of quantum 
mechanics to get a good approximate formula for the entropy of hydrogen, even if we 
are trying our best to treat this gas classically. 

Since I am a mathematical physicist, this book is full of math. I spend more time 
trying to make concepts precise and looking into strange counterexamples than an 
actual ‘working’ physicist would. If at any point you feel I am sinking into too many 
technicalities, don't be shy about jumping to the next tweet. The really important stuff 
is in the boxes. It may help to reach the end before going back and learning all the 
details. It's up to you. 
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THE ENTROPY OF THE OBSERVABLE UNIVERSE 


In 2010, Chas A. Egan and Charles H. Lineweaver estimated the 
biggest contributors to the entropy of the observable universe. Mea- 
suring entropy in bits, these are: 


e stars: 108! bits. 
interstellar and intergalactic gas and dust: 1082 bits. 
gravitons: 1088 bits. 


neutrinos: 10?? bits. 


photons: 109? bits. 


stellar black holes: 10?5 bits. 
supermassive black holes: 10105 bits. 


So, almost all the entropy is in supermassive black holes! 


In 2010, Chas A. Egan and Charles H. Lineweaver estimated the entropy of the 
observable universe. Entropy corresponds to unknown information, so there's a heck 
of a lot we don't know! For stars, most of this unknown information concerns the 
details of every single electron and nucleus zipping around in the hot plasma. There's 
more entropy in interstellar and intergalactic gas and dust. Most of the gas here is 
hydrogen—some in molecular form H5, some individual atoms, and some ionized. For all 
this stuff, the unknown information again mostly concerns the details, like the position 
and momentum, of each of these molecules, atoms and ions. 

There's a lot more we don't know about the precise details of other particles whizzing 
through the universe, like gravitons, neutrinos and photons. But there's even more 
entropy in black holes! One reason Stephen Hawking is famous is that he figured out 
how to compute the entropy of black holes. To do that you need a combination of 
statistical mechanics, general relativity and quantum physics. Statistical mechanics is 
the study of physical systems where there's unknown information, which you study 
using probability theory. ГЇЇ explain some of that in these tweets. General relativity 
is Einstein's theory of gravity, and while Гуе explained that elsewhere, I don't want to 
get into it here—so I will say nothing about the entropy of black holes. 

Quantum physics was also necessary for Hawking's calculation, as witnessed by 
the fact that his answer involves Planck's constant, which sets the scale of quantum 
uncertainty in our universe. I will try to steer clear of quantum mechanics in these 
tweets, but in the end we'll need a tiny bit of it. There's a funny sense in which statistical 
mechanics is somewhat incomplete without quantum mechanics. You'l eventually see 
what I mean. 


THE ENTROPY OF HYDROGEN 


At standard temperature and pressure, hydrogen gas has an entropy 
of 


130.68 joule/kelvin per mole 


But a joule/kelvin of entropy is about 


1.0449 - 10?? bits of unknown information 
and a mole of any chemical is about 
6.0221 - 10?? molecules 


So the unknown information about the precise microscopic state of 
hydrogen is 
1.0449 - 10?3 


130.68 . ———————- = 23 bits per molecule! 
6.0221 - 1023 


Egan and Lineweaver estimated the entropy of all the interstellar and intergalactic 
gas and dust in the observable universe. Entropy corresponds to information we don’t 
know. Their estimate implies that there are 10% bits of information we don’t know 
about all this gas and dust. 

Most of this stuff is hydrogen. Hydrogen is very simple stuff. So it would be 
good to understand the entropy of hydrogen. You can measure changes in entropy by 
doing experiments. If you assume hydrogen has no entropy at absolute zero, you can 
do experiments to figure out the entropy of hydrogen under other conditions. From 
this you can calculate that each molecule in a container of hydrogen gas at standard 
temperature and pressure has about 23 bits of information that we don’t know. 

You can see a sketch of the calculation above. But everything about it is far from 
obvious! What does ‘missing information’ really mean here? Joules are a unit of energy; 
kelvin is a unit of temperature. So why is entropy measured in joules per kelvin? Why 
does one joule per kelvin correspond to 1.0449 · 102° bits of missing information? How 
can we do experiments to measure changes in entropy? And why is missing information 
the same as—or more precisely proportional to—entropy? 

The good news: all these questions have answers! You can learn them here. How- 
ever, you will have to persist. Since I’m starting from scratch it won’t be quick. It 
takes some math—but luckily, nothing much more than calculus of several variables. 
When you can calculate the entropy of hydrogen from first principles, and understand 
what it means, that will count as true success. 

See how it goes! Partial success is okay too. 


WHERE ARE WE GOING? 


The mystery: why does each molecule of hydrogen have ~ 23 bits 
of entropy at standard temperature and pressure? 


The goal: derive and understand the formula for the entropy of a 
classical ideal monatomic gas: 


S = kN "i kT +1 id 
= 2 nytt 


including the mysterious constant ^y. 


The subgoal: compute the entropy of a single classical particle in a 
1-dimensional box. 


The sub-subgoal: explain entropy from the ground up, and compute 
the entropy of a classical harmonic oscillator. 


To understand something deeply, it can be good to set yourself a concrete goal. 
To avoid getting lost in the theory of entropy, let's try to understand the entropy of 
hydrogen gas. This is a ‘diatomic’ gas since a hydrogen molecule has two atoms. At 
standard temperature and pressure it's close to ‘ideal’, meaning the molecules don't 
interact much. It's also close to ‘classical’, meaning we don't need to know quantum 
mechanics to do this calculation. Also, when the hydrogen is not extremely hot, its 
molecules don't vibrate much—but they do tumble around. 

Given all this, we can derive a formula for the entropy S of some hydrogen gas 
as a function of its temperature T', the number N of molecules, the volume V, and a 
physical constant k called ‘Boltzmann’s constant. This formula also involves a rather 
surprising constant which I'm calling у. We'll figure that out too. It's so weird I don't 
want to give it away! 

As a warmup, we will derive the formula for the entropy of an ideal *monatomic' 
gas—a gas made of individual atoms, like helium or neon or argon. Sackur and Tetrode 
worked this out in 1912. Their result, called the Sackur—Tetrode equation, is similar to 
the one for a diatomic gas. 

But before doing a monatomic gas, we'll figure out the entropy of a single atom of 
gas in a box. That turns out to be a good start, since in an ideal monatomic gas the 
atoms don't interact, and the entropy of N atoms—as we'll see—is just N times the 
entropy of a single atom. 

But before we can do any of this, we need to understand what entropy is, and how 
to compute it. It will take quite a bit of time to compute the entropy of a classical 
harmonic oscillator! But from then on, the rest is surprisingly quick. 


FIVE KINDS OF ENTROPY 


Entropy in thermodynamics: the change in entropy as we change a 
system's internal energy by an infinitesimal amount dE while keeping 
it in thermal equilibrium is dS = dE/T, where T is the temperature. 


Entropy in classical statistical ^ mechanics: 
—k fy p(x)lnn(p(x))du(x) where p is a probability distribution 


on the measure space (X, ш) of states and k is Boltzmann’s constant. 


Entropy in quantum statistical mechanics: 5 = —ktr(plnp) where 
p is a density matrix. 


Entropy in information theory: H = —»;;cxpilogp; where p is а 
probability distribution on the set X. 


Algorithmic entropy: the entropy of a string of symbols is the length 
of the shortest computer program that prints it out. 


Before I actually start explaining entropy, a warning: it can be hard at first to learn 
about entropy because there are many kinds—and people often don't say which kind 
they're talking about. Here are 5 kinds. Luckily, they are closely related! 

In thermodynamics we primarily have a formula for the change in entropy: if you 
change the internal energy of a system by an infinitesimal amount dE while keeping it 
in thermal equilibrium, the infinitesimal change in entropy is dS = dE/T where Т is 
the temperature. 

Later, in classical statistical mechanics, Gibbs explained entropy in terms of a prob- 
ability distribution p on the space of states of a classical system. In this framework, 
entropy is the integral of —p ln p times a constant k called Boltzmann's constant. 

Later von Neumann generalized Gibbs' formula for entropy from classical to quantum 
statistical mechanics! He replaced the probability distribution p by a so-called density 
matrix p, and the integral by a trace. 

Later Shannon invented information theory, and a formula for the entropy of a 
probability distribution on a set (often a finite set). This is often called ‘Shannon 
entropy. It's just a special case of Gibbs' formula for entropy in classical statistical 
mechanics, but without the Boltzmann's constant. 

Later still, Kolmogorov invented a formula for the entropy of a specific string of 
symbols. It's just the length of the shortest program, written in bits, that prints out 
this string. It depends on the computer language, but not too much. 

There's a network of results connecting all these 5 concepts of entropy. I will first 
explain Shannon entropy, then entropy in classical statistical mechanics, and then en- 
tropy in thermodynamics. While this is the reverse of the historical order, it's the 
easiest way to go. 

I will not explain entropy in quantum statistical mechanics: for that I would feel 
compelled to teach you quantum mechanics first. Nor will I explain algorithmic entropy. 


FROM PROBABILITY TO INFORMATION 


How much information do you get when you learn an event of prob- 
ability p has happened? It’s 


— logp 
where we can use any base for the logarithm, usually e or 2. 


Example: Suppose I flip 3 coins that you know are fair. I tell you 
the outcome: “heads, tails, heads". That's an event of probability 
1/2?, so the information you get is 


1 
— log (ss) — 3log2 


or “3 bits" for short, since log 2 of information is called a bit. 


Here is the simplest link between probability and information: when you learn that 
an event of probability p has happened, how much information do you get? We say 
it's — log p. We take a logarithm so that when you multiply probabilities, information 
adds. The minus sign makes information come out positive. 

Beware: when I write ‘log’ I don't necessarily mean the logarithm base 10. I 
mean that you can use whatever base for the logarithm you want; this choice is like a 
choice of units. Whatever base b you decide to use, Pll call log, 2 a ‘bit’. For example, 
if I flip a single coin that you know is fair, and you see that it comes up heads, you 
learn of an event that’s of probability 1/2, so the amount of information you learn is 


1 
— log, = log, 2. 


That’s one bit! Of course if you use base b = 2 then this logarithm actually equals 1, 
which is nice. 
To understand the concept of information it helps to do some puzzles. 


Puzzle 1. First I flip 2 fair coins and tell you the outcome. Then I flip 3 more and tell 
you the outcome. How much information did you get? 


Puzzle 2. I roll a fair 6-sided die and tell you the outcome. Approximately how much 
information do you get, using logarithms base 2? 


Puzzle 3. When you flip 7 fair coins and tell me the outcome, how much information 
do I get? 


Puzzle 4. Every day I eat either a cheese sandwich, a salad, or some fried rice for lunch— 
each with equal probability. I tell you what I had for lunch today. Approximately how 
many bits of information do you get? 


Puzzle 5. I have a trick coin that always lands heads up. You know this. I flip it 5 
times and tell you the outcome. How much information do you receive? 


5 


Puzzle 6. I have a trick coin that always lands heads up. You believe it's a fair coin. I 
flip it 5 times and tell you the outcome. How much information do you receive? 


Puzzle 7. I have a trick coin that always lands with the same face up. You know this, 
but you don't know which face always comes up. I flip it 5 times and tell you the 
outcome. How much information do you receive? 


These puzzles raise some questions about the nature of probability, like: is it sub- 
jective or objective? People like to argue about those questions. But once we get a 
probability p, we can convert it to information by computing — log p. 


UNITS OF INFORMATION 


An event of probability 1/2 carries one bit of information. 
An event of probability 1/e carries one nat of information. 
An event of probability 1/3 carries one trit of information. 
An event of probability 1/4 carries one crumb of information. 
An event of probability 1/10 carries one hartley of information. 
An event of probability 1/16 carries one nibble of information. 


An event of probability 1/256 carries one byte of information. 


An event of probability 1/25!?? carries one kilobyte of information. 


There are many units of information. Using information — —logp we can relate 
these to probabilities. For example if you see a number in base 10, and each digit shows 
up with probability 1/10, the amount of information you get from each digit is one 
‘hartley’. 

How many bits are in a hartley? Remember: no matter what base you use, I call 
log 10 a hartley and log2 a bit. There are log 10/1062 bits in a hartley. This number 
has the same value no matter what base you use for your logarithms! If you use base 
2, it’s 

log, 10/ log; 2 = log, 10 ~ 3.32. 
So a hartley is about 3.32 bits. 

If you flip 8 fair coins and tell me what answers you got, I’ve learned of an event 
that has probability 1/25 = 1/256. We say I’ve received a ‘byte’ of information. This 
equals 8 bits of information. Similarly, if you flip 1024 x 8 fair coins and tell me the 
outcome, I receive a kilobyte of information. 

Or at least that's the old definition. Now many people define a kilobyte to be 1000 
bytes rather than 1024 bytes, in keeping with the usual meaning of the prefix. If you 
want 1024 bytes you're supposed to ask for a 'kibibyte. When we get to a terabyte, 
the new definition based on powers of 10 is about 10% less than the old one based on 
powers of 2: 1012 bytes rather than 240 zz 1.0995 х 107. If you want the old larger 
amount of information you should ask for a ‘tebibyte’. 

Wikipedia has an article that lists many strange units of information. Did you know 
that 2 bits is a ‘crumb’? Did you even need to know? No, but now you do. 

Feel free to dispose of this unnecessary information! All this is just for fun—but I 
want you to get used to the formula 


information = — log р 


THE INFORMATION IN А LICENSE PLATE NUMBER 


dmv.ca.gov _ 


If there are N different possible license plate numbers, all equally 
likely, how many bits of information do you learn when you see 
one? 


If you think N alternatives are equally likely, when you see which one actually 
occurs, you gain an amount of information equal to log, N. Here the choice of base b 
is up to you: it’s a choice of units. But what is this in bits? No matter what base you 
use, 

log, N = log, N x log, 2. 
Since we call log, 2 a ‘bit’, this means you’ve learned log, N bits of information. 

Let’s try it out! 


Puzzle 8. Suppose a license plate has 7 numbers and/or letters on it. If there are 
10+26 choices of number and/or letter, there are 36’ possible license plate numbers. If 
all license plates are equally likely, what’s the information in a license plate number in 
bits—approximately? 


Nov (7 /4 л ЕШ 


61KJ/30 


Veses ——— 


But wait! Suppose I tell you that all license plate numbers have a number, then 3 
letters, then 3 numbers! You have just learned a lot of information. So the remaining 
information content of each license plate is presumably less. Let's work it out. 
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Puzzle 9. How much information is there in a license plate number if they all have 
a number, then 3 letters, then 3 numbers? (Assume they're all equally probable and 
there are 10 choices of each number and 26 choices of each letter.) 


The moral: when you learn more about the possible choices, the information it takes 
to describe a choice drops. 


THE INFORMATION IN A LICENSE PLATE 


How much unknown information do the atoms in a license plate 
contain? 


Aluminum has an entropy of about 28 joules/kelvin per mole at 
standard temperature and pressure. A mole of aluminum weighs 
about 27 grams. A typical license plate might weigh 150 grams, and 


thus have 
28 J/K - mole 
150g x = 160 J/K 
27 g/mole 


of entropy. But a joule/kelvin of entropy is about 10?? bits of un- 
known information. Thus, the atoms in such a license plate contain 
about 


160 x 10?? bits zz 1.6 - 10?° bits 


of unknown information. 


Last time we talked about the information in a license plate number. А license plate 
number made of 7 numbers and/or letters contains 


log,(36") ~ 36.189 


bits of information if all combinations are equally likely. How does this compare to the 
information in the actual metal of the license plate? 

These days most license plates are made of aluminum, and they weigh roughly 
between 100 and 200 grams. Let's say 150 grams. If we work out the entropy of this 
much aluminum, and express it in bits of unknown information, we get an enormous 
number: roughly 


16, 000, 000, 000, 000, 000, 000, 000, 000 bits! 


Here is the point. While the information on the license plate and the information 
in the license plate can be studied using similar mathematics, the latter dwarfs the 
former. Thus, when we are doing chemistry and want to know, for example, how much 
the entropy of the license plate increases when we dissolve it in hydrochloric acid, the 
information in the writing on the license plate is irrelevant for all practical purposes. 

Some people get fooled by this, in my opinion, and claim that "information" and 
"entropy" are fundamentally unrelated. I disagree. 
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JUSTIFYING THE FORMULA FOR INFORMATION 


Why do we say the information of an event of probability p is 


I(p) = — log, p 


for some base b > 1? Here’s why: 


Theorem. Suppose I: (0,1] — R is a function that is: 


1. Decreasing: р < q implies I(p) > I(q). This says less probable 
events have more information. 


2. Additive: I(pq) = I(p) + I(q). This says the information of the 
combination of two independent events is the sum of their separate 
informations. 


Then for some b > 1 we have I(p) = — log, p. 


The information of an event of probability p is — log p, where you get to choose the 
base of the logarithm. But why? This is the only option if we want less probable events 
to have more information, and information to add for independent events. 

Proving this will take some math—but don't worry, you won't need to know this 
stuff for the rest of this ‘course’. 

Since we're trying to prove /(p) is a logarithm function, let's write 


I(p) = f(In(p)) 


and prove f has to be linear: 
f(t) e ex. 
As we'll see, this gets the job done. 
Writing (p) = f(x) where x = In p, we can check that Condition 1 above is equiv- 
alent to 
r « y implies f(x) > f(y) for all z, y < 0. 


Similarly, we can check that Condition 2 is equivalent to 


f(aty) = f(x) + f(y) for all х,у € 0. 


Now what functions f have 


f(x у) = f(x) + fly) 


for all x,y < 0? 

If we define f(—x) = — f(x), f will become a function from the whole real line to 
the real numbers, and it will still obey f(x +y) = f(x) + f(y). So what functions obey 
this equation? The obvious solutions are 


T) = ст 
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for апу real constant c. But are there any other solutions? 

Yes, if you use the axiom of choice! Treat the reals as a vector space over the 
rationals. Using the axiom of choice, pick a basis. To get f: IR — R that’s linear over 
the rational numbers, just let f send each basis element to whatever real number you 
want and extend it to a linear function defined on all of IR. This gives a function f that 
obeys f(x +y) = f(x) + fly). 


However, no solutions of f(x + y) = f(x) + f(y) meet our other condition 


ж < y implies f(x) > f(y) for all х,у < 0 


except for the familiar ones f(x) = cx. For a proof see Wikipedia: they show all 
solutions except the familiar ones are so discontinuous their graphs are dense in the 
plane! 


• Wikipedia, Cauchy's functional equation. 


So, our conditions imply f(x) = cx for some c, and since f is decreasing we need c < 0. 
So our formula /(p) = f(In p) says 


I(p) = clnp 


but this equals — log; p if we take b — exp(—1/c). And this number b can be any number 
» 1. QED. 


Thus, if we want а more general concept of the information associated to a proba- 
bility, we need to drop Condition 1 or 2. For example, we could replace additivity by 
some other rule. People have tried this! Indeed, there is a world of generalized entropy 
concepts including Tsallis entropies, Rényi entropies and others. 
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WHAT IS PROBABILITY? 


The theory of probabilities is at bottom nothing but common sense 
reduced to calculus; it enables us to appreciate with exactness 
that which accurate minds feel with a sort of instinct for which of 
times they are unable to account. — Pierre-Simon Laplace 


In no other branch of mathematics is it so easy for experts to 
blunder as in probability theory. — Martin Gardner 


Since Гуе defined information in terms of probability, you may naturally wonder 
“what is probability?" I won't seriously try to answer this. This question has stirred 
up many debates over the centuries, and even today there's not a fully accepted answer. 
It deserves a whole book—and this is not that book. Luckily, we don’t really need to 
know exactly what probability is to do calculations with it: we mainly need to set 
up some rules for working with it. This may seem like a cop-out. But it's a strange 
and wonderful feature of science that we can achieve great reliability in our results by 
sidestepping certain difficult questions, like someone who can make their way safely 
through a jungle by avoiding the quicksand and snakes. 

One approach to probability goes like this. Suppose you repeat some experiment 
N times, doing your best to make the conditions the same each time. Suppose that М 
of these times some event E occurs. You may then say that the probability of event E 
happening under these conditions is M/N. This approach is called ‘finite frequentism’. 
Unfortunately, this approach can lead you to say a coin has probability 1 of landing 
heads up if it does so the first time, or first 3 times, you flip it. 

Another approach goes like this. You may say that some event E has probability p 
under some conditions if when you set up these conditions N times, and the event E 
happens M times, the fraction M/N approaches p in the limit № — oo. This approach 
is called ‘hypothetical frequentism', because in real experiments you can't take the limit 
М — oo. But you can hope that when N becomes large enough, the fraction M/N 
usually becomes close to the limiting probability p—whatever that means. 

Another approach, called ‘Bayesianism’, treats a probability of an event Ё under 
some specified conditions as a measure of your degree of belief that E will happen 
under these conditions. But what is ‘degree of belief’? One answer involves bets. For 
example, perhaps to believe an event has probability 1/2 means you’re willing to take 
a bet where you win more when the event happens than you lose if it does not. 

Bayesians tend to focus on the rules for updating your probabilities as you learn new 
things, the most famous being ‘Bayes’ rule. Even if agents start by assigning different 
probabilities to an event, if they follow the same rules for changing these probabilities 
as they learn new things, under certain circumstances we can prove their probabilities 
will converge to the same value. 

For a passionate and intelligent discussion of these issues, I recommend E. T. Jaynes' 
book Probability Theory: the Logic of Science. Later we'll meet his ‘principle of max- 
imum entropy’, another important approach to working with probabilities. 
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PROBABILITY MEASURES 


A measure on a set X is a function that assigns to certain so-called 
measurable subsets 5 C X a number p(S) є [0,oo], obeying 
these rules: 


* Ø, X C X are measurable and 
m(0) — 0 


e If S, T C X are measurable and S C Т, then Т — 5 is 
measurable and 


m(T) = m(S) + m(T — S) 


• If a countable collection of disjoint subsets 5; C X are 
measurable, then their union is measurable and 


i=1 


We say m is a probability measure if m( X) = 1. 


It is easier to do calculations with probabilities than say exactly what they mean! I 
will take a rough-and-ready approach to working with them, but first let's take a peek 
at how mathematicians do it. If you don't care, it’s safe to move right on to the next 
tweet. 

We start with any set. We call elements of X ‘outcomes’ and subsets of X ‘events’. 
We can sometimes get into trouble trying to assign a probability to every subset of 
X. So, we'll only try to assign probabilites to events in some collection M with these 
properties: 


• бє ML and X € M. 


• If S, T € M and 5 C T then the set of elements of Т that are not in S, called 
T — S, is in M. 


e If 5; € M fori = 1,2,... then the union UZ, 5; is in М. 


We call elements of М measurable subsets of X. A measure is then a function m: М > 
[0, co] obeying these rules: 


• т(0) = 0 
© I£ S, T € M and S € M then m(T) = m(S) + m(T — S). 
‚ If S; € M then т (UZ, 5) = У, т(5;). 
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If m also obeys m(X) = 1 then we say m is a probability measure, and for апу 5 € М 
we say m(S) is the probability of the event S. But we will also be interested in other 
measures, like the measure on the real line called ‘Lebesgue measure. This is closely 
connected to the symbol ‘dx’ that shows up in integrals, because for any measurable 
set S C R, its Lebesgue measure is 


| Х5(®) dx 


where ys(x) is 1 for z € S and 0 for x ¢ S. Indeed, people often get sloppy and say 
dx ‘is’ Lebesgue measure, and I may do that too. By the way, Lebesgue measure is one 
where we cannot take M to be the collection of all subsets of R. 

There is an extensive theory of measures. We will not need it here, but if you’re 
interested, you can try a book like Halsey Royden’s Real Analysis, where I learned the 
basics myself, or Terry Tao’s An Introduction to Measure Theory, which has a legal 
free version online. 

Here are some puzzles about measures. 


Puzzle 10. Let X be any set and define M to be the collection of all subsets of X. 
Show that there is a measure m: М — [0, оо] called counting measure such that for 
апу 5 С X, m(S) is the number of elements of S, or oo if S is infinite. 


Puzzle 11. Let X be any set and define M as before. Suppose p is a probability 
distribution on X, meaning a function p: X — [0, оо) with <x p(t) = 1. Show that 
there is a probability measure m: М — |0, оо] such that for any S С X, 
m(S) = У pl). 
ics 

In this situation we usually write p(i) as p; and call it the probability of the outcome 
i € X. For any S C M we call m(S) the probability of the event S. 

In the next puzzles X is any set, M obeys the three rules for a collection of mea- 
surable subsets of X, and m: М — [0, оо] is a measure. 


Puzzle 12. Show that if S, Т € M then the union SUT is in M. 

Puzzle 13. Show that if S, Т € M then the intersection 5 ПТ is in М. 

Puzzle 14. Show that if 5; € М for i = 1,2,... then the intersection (72, 5; is in М. 
Puzzle 15. Show that if S, T € M and 5 CT then m(S) € m(T). 

Puzzle 16. Show that if 5; € M for i — 1,2,... then 


Puzzle 17. Show that if m is a probability measure and 5 € М then 0 € m(S) < 1. 


One of the main uses of a measure m on a space X is that it lets us integrate certain 
functions f: X — К. Alas, not all functions! It’s only reasonable to try to integrate 
measurable functions f: X — R, which have the property that if 5 C R is measurable, 
its inverse image f^! (S) C X is measurable. And even measurable functions can cause 
trouble, because when we try to integrate them we might get +00, —oo, or something 
even worse. For example, what's 


© 2 
? x“ sin z dz? 


— oo 


There's no good answer. We say a function f: X — R is integrable if it is measurable 
and its integral over X, defined in a certain way I won't explain here, gives a well-defined 
real number. 
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SHANNON ENTROPY: A FIRST TASTE 


When you learn an event of probability p has happened, the amount 
of information you get is — log p. 


Question. Suppose you know a coin lands heads up i of the time and 
tails up E of the time. What is the average or ‘expected’ amount of 
information you get when you learn which side landed up? 


Answer. 2 of the time you get — log 2 of information, and i of the 
time you get — log i. So, the expected amount of information you 
get is 


2 2 1 1 
=, log — log. 


You can do the same thing whenever you have any number of prob- 
abilities that add to 1. The expected information is called the Shan- 
non entropy. 


You flip а coin. You know the probability that it lands heads up. How much 
information do you get, on average, when you discover which side lands up? It's not 
hard to work this out. It's a simple example of ‘Shannon entropy: Roughly speaking, 
entropy is information that you don't know, that you could get if you did enough 
experiments. Here the experiment is simply flipping the coin and looking at it. 


Puzzle 18. Suppose you know a coin lands heads up i of the time and tails up i of the 
time. What is the expected amount of information you get from a coin flip? If you use 
base 2 for the logarithm, you get the expected information measured in bits. What do 
you get? 


Puzzle 19. Suppose you know a coin lands heads up i of the time and tails up ? of the 
time. What is the expected amount of information you get from a coin flip? 


w| 


Puzzle 20. Suppose you know a coin lands heads up I of the time and tails up 2 of the 
time. What is the expected amount of information you get from a coin flip, in bits? 


eÁel 


If you solve these you'll see a pattern: the Shannon entropy is biggest when the 
coin is fair. As it becomes more and more likely for one side to land up than the other, 
the entropy drops. You're more sure about what will happen... so you learn less, on 
average, from seeing what happens! 

We've been doing examples where your experiment has just two possible outcomes: 
heads up or down. But you can do Shannon entropy for any number of outcomes. It 
measures how ignorant you are of what will happen. That is: how much you learn on 
average when it does! 
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SHANNON ENTROPY: A SECOND TASTE 


According to the weather report there's a 1 chance that it will rain 
1 centimeter, a i chance it will rain 2 centimeters, and a i chance 


it will rain 3 centimeters. 


Question. What is the ‘expected’ amount of rainfall? 


Answer. i -1+ 2 -2 + i - 3 = 2 centimeters. 


Question. What is the ‘expected’ amount of information you learn 
when you find out how much it rains? 


Answer. -i log 1 — 2 log 5 — 1 log 1 = 2 log 2, or in other words, 
3 bits. This is the Shannon entropy of the weather report. 


If the weather report tells you it'll rain different amounts with different probabilities, 
you can figure out the ‘expected’ amount of rain. You can also figure out the expected 
amount of information you'll learn when it rains. This is called the ‘Shannon entropy’. 

Shannon entropy is closely connected to information, but we can also think of it as 
a measure of ignorance. This may seem paradoxical. But it's not. Shannon entropy 
is the expected amount of information that you don't know when all you know is a 
probability distribution, which you will know when you see a specific outcome chosen 
according to this probability distribution. 

For example, consider a weather report that says it will rain 1 centimeter with 
probability 0, 2 centimeters with probability 1, and 3 centimeters with probability 0. 
The Shannon entropy of this weather report is 


—01og0 — 11061 — 01060 = 0 


since by convention we set plogp = 0 when p = 0, this being the limit of pln p as p 
approaches 0 from above. 

What does it mean that this weather report has zero Shannon entropy? It means 
that when we see a specific outcome chosen according to this probability distribution, 
we learn nothing! The weather report says it will rain 2 centimeters with probability 
1. When this happens, we learn nothing that the weather report didn't already tell us. 

The Shannon entropy doesn't depend on the amounts of rain, or even whether the 
forecast is about centimeters of rain or dollars of income. It only depends on the 
probabilities of the various outcomes. So Shannon entropy is a universal, abstract 
concept. 

Shannon entropy is closely connected to Gibbs entropy, which was already known 
in physics. But by lifting entropy to a more general level and connecting it to digital 
information, Shannon helped jump-start the information age. In fact a paper of his was 
the first to use the word ‘bit’! 
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THE DEFINITION ОЕ SHANNON ENTROPY 


Suppose you believe there are n possible outcomes with 
probabilities p,,...,p, > О that sum to 1. 


The average amount of information you learn when one of these 
outcomes happens, chosen according to this probability 
distribution, is the Shannon entropy: 


Н = — У pilogpi 


i=l 


Shannon entropy is larger for probability distributions that are 
more spread out, and smaller for probability distributions that are 
more sharply peaked. 


I've been leading up to it with examples, but here it is in general: Shannon entropy! 
Gibbs had already used a similar formula in physics—but with base e for the logarithm, 
an integral instead of a sum, and multiplying the answer by Boltzmann's constant. 
Shannon applied it to digital information. 

Here's where the formula for Shannon entropy comes from. We have some set of 
outcomes, say X. We have a probability distribution on this set, meaning a function 
p: X — [0,1] such that 

УЭ pi = 1. 


ic X 


If we have any function A: X — К, we define its expected value to be 


(А) = у, piAj. 


ic X 


It’s a kind of average of A where each value A(i) is ‘weighted’, i.e. multiplied, by the 
probability of the ith outcome. We saw an example in the last tweet: the expected 
amount of rainfall. 

We've seen that if you believe the ith outcome has probability р;, the amount of 
information you learn if the ith outcome actually occurs is — log p;. Thus, the expected 
amount of information you learn is 


(—logp) = — M pj log pi. 
ЄХ 


And this is the Shannon entropy! We denote it by Н, or more precisely H(p), so 


H(p) = — У? pi log pi. 
ЄХ 


In the box above I was taking X {о be the set (1,...,n). This is often а good thing to 
do when there are finitely many outcomes. 
Let's get to know the Shannon entropy a little better. 
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Puzzle 21. Let X = {1,2} so that we know a probability distribution р on Х if we 
know р, since pp = 1 — pı. Graph the Shannon entropy Н (p) as a function of ру. Show 
that it has a maximum at ру = i and minima at ру = 0 and p; = 1. 


This makes sense: if you believe ру = 1 then you learn nothing when an outcome 
happens chosen according to the probability distribution p: you are sure outcome 1 will 
occur, and it does (with probability 1). Similarly, if you believe p; — 0 you learn nothing 
when an outcome happens according to this probability distribution, since you are sure 
outcome 2 will occur. On the other hand, if p = 5 you are maximally undecided about 
what will happens, and you learn 1 bit of information when it does. 


Puzzle 22. Let X = {1,2,3}. Draw the set of probability distributions on X as an 

equilateral triangle whose corners are the probability distributions (1,0,0), (0, 1,0), 

and (0,0, 1). Sketch contour lines of H (p) as a function on this triangle. Show it has a 
111 


maximum at р = (5, з, з) and minima at the corners of the triangle. 


Again this should make intuitive sense. Here is a harder puzzle along the same lines: 


Puzzle 23. Let X = {1,...,n}. Show that H(p) has a maximum at p = (1,..., +) and 


n? 


minima at the probability distributions where p; = 1 for some particular i € X. 


Here is one of the big lessons from all this: 


Shannon entropy is larger for probability distributions that are more spread out, 
and smaller for probability distributions that are more sharply peaked. 


Indeed, you can think of Shannon entropy as a measure of how spread out a proba- 


bility distribution is! The more spread out it is, the more you learn when an outcome 
occurs, drawn from that distribution. 
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THE PRINCIPLE OF MAXIMUM ENTROPY 


Suppose there are n possible outcomes. At first you have no 
reason to think any is more probable than any other. 


Then you learn some facts about the correct probability 
distribution—but not enough to determine it uniquely! Which 
probability distribution p,,...,p, should you choose? 


The principle of maximum entropy says: 


Of all the probability distributions consistent with the facts 
you've learned, choose the one with the largest Shannon entropy 


Н = – 2 Pi log pi 


1=1 


What’s Shannon entropy good for? For starters, it gives a principle for choosing 
the ‘best’ probability distribution consistent with what you know. Choose the one that 
maximizes the Shannon entropy! 

This is called the ‘principle of maximum entropy. This principle first arose in 
statistical mechanics, which is the application of probability theory to physics—but we 
can use it elsewhere too. 

For example: suppose you have a die with faces numbered 1,2,3,4,5,6. At first you 
think it’s fair. But then you somehow learn that the average of the numbers that comes 
up when you roll it is 5. Given this, what’s the probability that if you roll it, a 6 comes 
up? 

Sounds like an unfair question! But you can figure out the probability distribution 
on {1,2,3,4,5,6} that maximizes Shannon entropy subject to the constraint that the 
mean is 5. According to the principle of maximum entropy, you should use this to 
answer my question! 

But is this correct? 

The problem is figuring out what ‘correct’ means! But in statistical mechanics we 
use the principle of maximum entropy all the time, and it seems to work well. The 
brilliance of E. T. Jaynes was to realize it’s a general principle of reasoning, not just 
for physics. 

The principle of maximum entropy is widely used outside physics, though still con- 
troversial. But I think we should use it to figure out some basic properties of a gas—like 
its energy or entropy per molecule, as a function of pressure and temperature. 

To do this, we should generalize Shannon entropy to ‘Gibbs entropy’, replacing the 
sum by an integral. Or else we should ‘discretize’ the gas, assuming each molecule has 
a finite set of states. It sort of depends on whether you prefer calculus or programming. 
Either approach is okay if we study our gas using classical statistical mechanics. 

Quantum statistical mechanics gives a more accurate answer. It uses a more general 
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definition of entropy—but the principle of maximum entropy still applies! 

I won't dive into any calculations just yet. Before doing a gas, we should do some 
simpler examples—like the die whose average roll is 5. But I can’t resisting mentioning 
one philosophical point. In the box above I was hinting that maximum entropy works 
when your ‘prior’ is uniform: 


Suppose there are n possible outcomes. At first you have no reason to 
think any is more probable than any other. 


This is an important assumption: when it’s not true, the principle of maximum entropy 
as we've stated it does not apply. But what if our set of events is something like a 
line? There’s no obvious best probability measure on the line! And even good old 
Lebesgue measure dx depends on our choice of coordinates. To handle this, we need a 
generalization of the principle of maximum entropy, called the principle of maximum 
relative entropy. 

In short, a deeper treatment of the principle of maximum entropy pays more atten- 
tion to our choice of ‘prior’: what we believe before we learn new facts. And it brings 
in the concept of ‘relative entropy’: entropy relative to that prior. But we won't get 
into this here, because we will always be using a very bland prior, like assuming that 
each of finitely many outcomes is equally likely. 
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ADMITTING YOUR IGNORANCE 


Suppose you describe your knowledge of a system with n states 
using a probability distribution ру,..., р». 
Then the Shannon information 


Н = — M pilogpi 


i—1 
measures your ignorance of the system's state. 


So, choosing the maximum-entropy probability distribution 
consistent with the facts you know 
amounts to 
not pretending to know more than you do. 


Remember: if we describe our knowledge using a probability distribution, its Shan- 
non entropy says how much we expect to learn when we find out what's really going 
on. We can roughly say it measures our 'ignorance'—though ordinary language can be 
misleading here. 


At first you think this ordinary 6-sided die is fair. But then you learn that no, the 
average of the numbers that come up is 5. What are the probabilities pi, ..., Pn for the 
different faces to come up? 

This is tricky: you can imagine different answers! 

You could guess the die lands with 5 up every time. In other words, ps = 1. This 
indeed gives the correct average. But the entropy of this probability distribution is 0. 
So you're claiming to have no ignorance at all of what happens when you roll the die! 

Next you might guess that it lands with 4 up half the time and 6 up half the time. 
In other words, p4 = pg = 2. This probability distribution has 1 bit of entropy. Now 
you are admitting more ignorance. But how can you be so sure that 5 never comes up? 
Next you might guess that p4 = pg = 1 апа рѕ = i We can compute the entropy 
of this probability distribution. It's higher: 1.5 bits. Good, you're being more honest 
now! But how can you be sure that 1, 2, or 3 never come up? You are still pretending 
to know stuff! 

Keep improving your guess, finding probability distributions with mean 5 with big- 
ger and bigger entropy. The bigger the entropy gets, the more you're admitting your 
ignorance! If you do it right, your guess will converge to the unique maximum-entropy 
solution. 

But there's a more systematic way to solve this problem. 
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THE BOLTZMANN DISTRIBUTION 


Suppose you want to maximize the Shannon entropy 


— M p; log pi 


i=1 


of a probability distribution p,,..., рь, subject to the constraint that the 
expected value of some quantity A; is some number c: 


Then try the Boltzmann distribution: 
ехр(—8А;) 
> exp(—BAi) 

i=1 


i = 


If you can find В that makes (ж) hold, this is the answer you seek! 


How do you actually use the principle of maximum entropy? 

If you know the expected value of some quantity and want to maximize entropy 
given this, there's a great formula for the probability distribution that usually does 
the job! It's called the ‘Boltzmann distribution: In physics it also goes by the names 
‘Gibbs distribution’ or ‘canonical ensemble’, and in statistics it’s called an ‘exponential 
family’. 

In the Boltzmann distribution, the probability p; is proportional to ехр(— 8 А;) where 
A is the quantity whose expected value you know. Since probabilities must sum to one, 
we must have 

ехр(-ВА;) 


У exp(—BAi) 
i=1 


It is then easy to find the expected value of A as a function of the number p: just plug 
these probabilities into the formula 


1 


(4) = УА 


The hard part is inverting this process and finding 5 if you know what you want (A) 
to be. 

When and why does the Boltzmann distribution actually work? That's a bit of a 
long story, so ГЇЇ explain it later. First, let's use the Boltzmann distribution to solve 
the puzzle I mentioned last time: 


At first you think this ordinary 6-sided die is fair. But then you learn that 
no, the average of the numbers that come up is 5. What are the probabilities 
P1,--+;Pn for the different faces to come up? You can use the Boltzmann 
distribution to solve this puzzle! 
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To do it, take 1 <i € 6 and A; = i. Stick the Boltzmann distribution p; into the 
formula У); Aip; = 5 and get a polynomial equation for ехр(— 8). You can solve this 
with a computer and get exp(—/) = 1.877. 

So, the probability of rolling the die and getting the number 1 < 7 < 6 is pro- 
portional to exp(—/1) zz 1.8777. You can figure out the constant of proportionality 
by demanding that the probabilities sum to 1—or just look at the formula for the 
Boltzmann distribution. You should get these probabilities: 


pi © 0.02053, pa ғ 0.03854, рз ху 0.07232, py = 0.1357, ps = 0.2548, pe e 0.4781. 


You can compute the entropy of this probability distribution, and you get roughly 1.97 
bits. You'l remember that last time we got entropies up to 1.5 bits just by making 
some rather silly guesses. 

So, using the Boltzmann distribution, you can find the maximum-entropy die that 
rolls 5 on average. Later, we'll see how the same math lets us find the maximum-entropy 
state of a box of gas that has some expected value of energy! 
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MAXIMIZATION SUBJECT TO A CONSTRAINT 


To maximize a smooth function f of several variables 
subject to a constraint on some smooth function g, 
look for a point where 


V f = AVg 


for some number A. 


N 


g=constant 


When we're trying to maximize entropy subject to a constraint, we're doing a prob- 
lem of the above sort. If you don't know how to do problems like this, it's time to 
learn about Lagrange multipliers. You can find this in any book on calculus of several 
variables. But the idea is in the picture above. Say we've got two smooth functions 
f.g: R” — R and we have a point on the surface g = constant where f is as big as 
it gets on this surface. The gradient of f must be perpendicular to the surface at this 
point. Otherwise we could move along the surface in a way that made f bigger! For 
the same reason, the gradient of g is always perpendicular to the surface g — constant. 
So Vf and Vg must point in the same or opposite directions at this point. Thus, as 
long as the gradient of g is nonzero, we must have 


Vf = лүд 
for some number A, called а Lagrange multiplier. So, solving this equation along with 
д = constant 


is a way to find the point we're looking for—though we still need to check we've found 
a maximum, not a minimum or something else. 
We can write a formula that means the exact same thing as Vf = AVg using 
differentials: 
df — Adg 


This is what we'll do from now on. Gradients are vector fields while differentials are 
1-forms. If you don't know what this means, you can probably ignore this for now: the 
difference, while ultimately quite important, will not be significant for anything we're 
doing. 
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MAXIMIZING ENTROPY SUBJECT TO A CONSTRAINT 


To maximize the entropy 
Н = – s: pi ln pi 
i=1 
subject to a constraint on the expected value 
(A) = Y pds 


it’s good to look for a probability distribution such that 


dH = Ad(A) 


for some number A. This will then be a 
Boltzmann distribution: 


exp(—A4A;) 


= exp(—AA;) 


i=l 


(А) = 
constant 


We’ve seen how to maximize a function subject to a constraint. Now let’s do the 
case we're interested in: maximizing entropy subject to a constraint on the expected 
value of some quantity. 

Suppose we have a finite set of outcomes, say 1,...,n. Our ‘quantity’ A is just a 
number Aı,..., An depending on the outcome. For any probability distribution p on 
the set of outcomes, we can define its Shannon entropy and the expected value of A: 


Н = – Ур In pi, (A) — nA. 
i=l i=1 


Here we are using base e for the Shannon entropy, to simplify the calculations. Let’s 
try to find the probability distribution that maximizes Н on the surface (A) = c. We'll 
show that if such a probability distribution p exists, and none of the p; are zero, then 
p must be a Boltzmann distribution 


(| exp(-AA;) 
X exp(-AA;) 
i=l 


i 


for some А € R. If you're willing to trust me on this, you can skip this calculation. 
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To use the method from last time—the Lagrange multiplier method—we'd like to 
use the probabilities p; as coordinates on the space of probability distributions. But 
they aren't independent, since 


i=1 


To get around this, let’s use all but one of the p; as coordinates, and remember that 
the remaining one is a function of those. Let's use po, p3,..., p, as coordinates, so that 
pı = 1— (pa: pa). Furthermore, the space of all probability distributions on our 
finite set is 


{pe R"|0<p, € 1, dma}. 

i=l 
It looks like a closed interval when n = 2, or a triangle when n = 3, or a tetrahedron 
when n = 4, or some higher-dimensional version of a tetahedron when n is larger. In 
its interior this space looks locally like IR"^!, so we can use the Lagrange multiplier 
method, but it also has a boundary where one or more of the p; vanish, and then this 
method no longer applies. (We'll see an example of that later.) 

So, let's assume p is a probability distribution maximizing the Shannon entropy H 
on the surface (A) = c, and also suppose p has pi,...,p, > 0. Suppose that not all 
the values A; are equal, since that makes the problem too easy-see why? Then d(A) is 
never zero, so from what I said last time, we must have 


dH — Ad(A) 
at the point p. So let's see what this equation actually says. 
Since " 
He Ун In pi 
i-l 
we have 


n n 


ан = — X` d(piln pi) = — } (1 + n p;)dp:. 


i=1 1=1 


Similarly, since 
(A) = УА; 
i=1 


we have 


d(A) = 5 Aidpi. 
i=l 
So, our equation dH = Ad(A) says 
— УП + In р;)ар; = А 5 А;ар;. 
1—1 i-l 


For these to be equal, the coefficients of dp; must agree for each of our coordinates 
po,...,p». But we have to remember that p; = 1 — (po + --- + pn) and thus dp; = 
— (dpa + --- + dpn). Thus, for each i = 2,...n we have 


and fiddling around we get 
pi | exp(—AAj) 
py  exp(-AAi) 
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This says something cool: the probabilities p; are proportional to the exponentials 
exp(—AA;). And since the probabilities must sum to 1, it's obvious what the constant 


of proportionality must be: 
|, exp(-AA;) 


У ›ехр(—АА;) 
i-l 
So yes: p; must be given by the Boltzmann distribution! 

In summary, we've seen that if there exists a probability distribution p that max- 
imizes the Shannon entropy among probability distributions with (A) = c, and if all 
the p; are positive, then p must be a Boltzmann distribution. But this raises other 
questions. When does such a probability distribution exist? If it exists, is it unique? 
And what if not all the p; are positive? 

In what follows we'll dive down this rabbit hole and get to the bottom of it. ГИ 
just state some facts—you may enjoy trying to see if you can prove them. First, there 
exists a probability distribution p,,...,p, with (A) = c if and only if 


Armin < C < Аулак 


where Amin is the minimum value and Аах is the maximum value of the numbers 
Aj,...,An. Second, whenever 
Amin < C < A max, 


there exists a unique probability distribution pi,...,p, maximizing Shannon entropy 
subject to the constraint (A) = c. Third, this unique maximizer p has p; > 0 for all i, 
and is thus a Boltzmann distribution, whenever 


Amin <C< Да, 


When c = Amin, the unique maximizer assigns the same probability p; to each outcome 
i with A; = Amin, while p; = 0 for all other outcomes. Similarly, when c = Amax, the 
unique maximizer assigns the same probability p; to each outcome ? with A; = Amax, 
while p; — 0 for all other outcomes. 

It's good to look at an example: 


Puzzle 24. Suppose there are only two outcomes, with A; = —1 and A; = 1. Work out 
the Boltzmann distribution p maximizing Shannon entropy subject to the constraint 
(A) = c for —1 < c < 1. Show that as А — +00 this Boltzmann distribution has 


pı > 1,p2 > 0 


while as А — —oo it has 
pi 0, рә — 1. 


Show the probability distribution pı = 1,p» = 0 maximizes Shannon entropy subject 
to the constraint (A) = —1, while p; = 0, p2 = 1 maximizes it subject to the constraint 
(А) = 1. Show these two probability distributions are not Boltzmann distributions. 


28 


THERMAL EQUILIBRIUM 


Suppose a system has finitely many states ? = 1,..., n with 
energies E;. If the probability p; that it’s in the ith state maximizes 
entropy subject to a constraint on its expected energy: 


(E) = x p. E, 


we say it is in thermal equilibrium. In this case p; is given by a 
Boltzmann distribution 


exp(—B E;) 
Уу exp(—BE;) 


i=1 


1 = 


at least if all the probabilities p; are positive. 


Don't worry: the substance of what I’m saying here is almost the same as in the 
last box. I’m merely attaching new words to the concepts, to make them sound more 
like physics: 


e Before I said we had a set of n ‘outcomes’ numbered 1,2,...,n. Now I’m talking 
about ‘states’. If we have a system with n states, it means there are n outcomes 
when we do a measurement to completely determine which state it’s in. A ‘state’ 
is some way for a physical system to be—that’s vague but it’s all we can say until 
we consider some specific kind of system. In classical physics the states form a 
set, usually infinite but sometimes finite. 


e Before I said we had a ‘quantity’ A that depends on the outcome, taking the value 
A; in the ith outcome. Now I'm calling this quantity the ‘energy’ E. Energy is 
a particularly interesting quantity in physics, so we'll focus on that, without 
demanding that you know anything about it: for our present purposes, we can 
take any quantity and dub it ‘energy’. 


• Before I called the Lagrange multiplier А. Now I'm calling it £, because that’s 
what physicists do in this particular context. 


When a system maximizes entropy subject to a constraint on the expected value 
of its energy, and perhaps also some other quantities, we say the system is in thermal 
equilibrium. This is meant to suggest that an object just sitting there, not heating up 
or cooling down, is often best modeled this way. 

You may have noticed the annoying clause “at least if all the probabilities p; are 
positive" I only said that because I cannot tell a lie. In Puzzle 24 we saw that as 
В — +оо, the Boltzmann distribution can converge to a non-Boltzmann probability 
distribution where some of the probabilities p; vanish. This still counts as thermal 
equilibrium, because it's still maximizing entropy subject to a constraint on expected 
energy. We'll learn more about this when we study the concept of ‘absolute zero’. 


29 


COOLNESS 


If a probability distribution p; maximizes entropy sub- 
ject to a constraint on the expected value of the energy 
Ei, then 


pi x e PF 


where 8 is the coolness, inversely proportional to tem- 
perature. So: 


The cooler a system is, the less likely it is to be ina 
high-energy state! 


Say a system with finitely many states maximizes entropy subject to a constraint 
on the expected value of some quantity E that we choose to call ‘energy’. Then its 
probability of being in the ith state is proportional to exp( — 8. E;) for some number £. 

When £ is big and positive, the probability of being in a state of high energy is tiny, 
since exp(—(£E;) gets very small for large energies E;. This means our system is cold. 

Conversely when 8 is small and positive, exp(—(GE;) drops off very slowly as the 
energy E; gets bigger. So, high-energy states become quite likely when 6 is small and 
positive. This means our system is Aot. 

It turns out p is inversely proportional to the temperature—more about that later. 
But in modern physics is just as important as temperature. It comes straight from 
the principle of maximum entropy! 

So B deserves a name. And its name is ‘coolness’. 

By the way, the formula 

p; x e PR 
is only strictly true when f is finite. There's also a limiting case 2 — +оо, when p; = 0 
except for states of the very lowest energy. And there's a limiting case 2 — —oo, where 
pi = 0 where except for states of the very highest energy. ГЇЇ say a bit about these 
oddities later. First ГЇЇ say more about what coolness has to do with temperature. 
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COOLNESS VERSUS TEMPERATURE 


Coolness 8 is inversely proportional to temperature Т: 


1 


б=т 


where k is Boltzmann's constant. 


Coolness is measured in joules ^ +, 


temperature is measured in kelvin, and 
Boltzmann's constant is a conversion factor: 


23 joules 


k = 1.380649 - 10^ 
kelvin 


In statistical mechanics, coolness is inversely proportional to temperature. But 
coolness has units of energy" !, not temperature !. So we need a constant to convert 
between coolness and inverse temperature! And this constant is very interesting. 

Remember: when a system maximizes entropy with a constraint on its expected 
energy, the probability of it having energy E is proportional to exp(—8 E) where f is 
its coolness. But we can only exponentiate dimensionless quantities! (Why?) So 8 has 
dimensions of 1 /energy. 

Since coolness is inversely proportional to temperature, we must have 8 = 1/kT 
where k is some constant with dimensions of energy /temperature. This constant k is 
called ‘Boltzmann’s constant’. It's tiny: 


k = 1.380649 - 107” joules/kelvin. 


This is mainly because we use units of energy, joules, suited to macroscopic objects 
like a cup of hot water. Boltzmann's constant being tiny reveals that such things have 
enormously many microscopic states! 

Later we'll see that a single classical point particle, in empty space, has energy ЗЕТ/2 
when it's maximizing entropy at temperature T'. The 3 here is because the atom can 
move in 3 directions, the 1/2 because we integrate z? to get this result. The important 
part is kT. The kT says: if an ideal gas is made of atoms, each atom contributes just a 
tiny bit of energy per kelvin, or degree Celsius: roughly 10~?° joules. So a little bit of 
gas, like a gram of hydrogen, must have roughly 107° atoms in it. This is a very rough 
estimate, but it's a big deal. 

Indeed, the number of atoms in a gram of hydrogen is about 6- 10. You may have 
heard of Avogadro's number—this is quite close to that. So Boltzmann's constant gives 
a hint that matter is made of atoms—and even better, a nice rough estimate of how 
many per gram! 

Later we will see that Boltzmann's constant has another important meaning: it's а 
fundamental unit of entropy, a nat, expressed in joules/kelvin. 
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TEMPERATURE 


If a system has finitely many states with energies E;, 
in thermal equilibrium at temperature T' 
the probability that it’s in the zth state is 


pi x exp(— E;/ KT) 


where k is Boltzmann’s constant and 
T' can be positive, negative, or even infinite: 


A system with finitely many states can be pretty weird. It can have negative tem- 
perature! Even weirder: as you heat it up, its temperature becomes large and positive, 
then it reaches infinity, and then it ‘wraps around’ and become large and negative. 

The reason: coolness is more fundamental than temperature. The coolness f is 
inversely proportional to the temperature T: 


B = 1/ЁТ. 


When the temperature goes up to infinity and then suddenly becomes a large nega- 
tive number, it's really just the coolness going down to zero and becoming negative. 
Temperatures *wrap around' infinity, as shown in the picture. 

A system with finitely many states can have negative or infinite temperature because 
in thermal equilibrium, its probability of being in the ith state is 


_ exp(- E) 
У ехр(-ВЕ) 
1=1 


1 


where £; is the energy of the ith state, and this makes sense for any 2 € IR. Moreover, 
the probability p; depends continously on 8, even as 3 passes through zero. This means 
a large positive temperature is almost like a large negative temperature! 


But the circle of temperature can be misleading. Temperatures wrap around T' = oo 
but not T' = 0. А system with a small positive temperature is very different from one 
with a small negative temperature! That's because p; for 8 >> 0 is very different than 
it is for 8 « 0. 
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For a system with finitely many states we can take the limit of the Boltzmann 
distribution when £ — +оо; then the system will only occupy its lowest-energy state or 
states. We can also take the limit when 2 — —oo; then the system will only occupy its 
highest-energy state or states. In terms of temperature, this means that the limit where 
Т approaches zero from above is very different than the limit where Т approaches zero 
from below. 

So, for a system with finitely many states, the best picture of possible thermal 
equilibria is not a circle of temperatures but a closed interval of coolness: the coolness 
B can be anything in [—oo, +оо], which topologically is a closed interval. In terms 
of coolness, +оо is different from —oo, but approaching 0 from above is the same as 
approaching it from below. But in terms of temperature, approaching 0 from above is 
different from approaching 0 from below, while a temperature of 4-oo is the same as a 
temperature of —oo. 

Now, if all this seems very weird, here's why: we often describe physical systems 
using infinitely many states, with a lowest possible energy but no highest possible 
energy. In this case the sum in the Boltzmann distribution can't converge for 3 < 0, so 
negative temperatures are ruled out. 

However, some physical systems can be approximately described using a finite set 
of states (or in quantum theory, a finite-dimensional Hilbert space of states). Then the 
things I just said hold true! And people enjoy studying these systems, and their strange 
properties, in the lab. 

It's good to look at a simple example, and work everything out explicitly: 


Puzzle 25. Suppose a system has two states with energies E; Z E». Compute the 
probabilities p; that it is in either of these states in thermal equilibrium as a function 
of the coolness 8. Then express these probabilities as a function of the temperature Т. 
Using these functions p;(T): 


• Show that when 0 < Т < +оо the system is more likely to be in the lower-energy 
state: p1(T) > po(T). 


• Show that when —oo < T < 0 the system is more likely to be in the higher-energy 
state: p1(T) < po(T). 
e Show that 
lim. p(T) = а рТ) 


Т- +оо 


so we can speak unambiguously of ће probabilities p; at infinite temperature. 


• Show that at infinite temperature the system has an equal probability of being in 
either state. 


• Show that as Т approaches zero from above, the probability of the system being 
in the lower energy state approaches 1. 


• Show that as T' approaches zero from below, the probability of the system being 
in the higher energy state approaches 1. 
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INFINITE TEMPERATURE 


If a system has finitely many states with energies E;, 
in thermal equilibrium at temperature T' 
the probability that it’s in the ith state is 


proce 


where @ = 1/kT and k is Boltzmann's constant. 


When f = 0 the system's temperature becomes infinite, 
and all states become equally probable! 


The probability of finding a system in a particular state decays exponentially with 
energy when the coolness 2 is positive. But for a system with finitely many states, б 
can be zero. Then it becomes equally probable for the system to be in any state! 

Zero coolness means ‘utter randomness’: that is, maximum entropy. 

Here's why. The probability distribution with the largest entropy is the one where 
all probabilities p; are all equal. This happens at zero coolness! When 8 = 0 we 
get exp( —8 E;) = 1 for all i. The probabilities p; are proportional to these numbers 
exp(—5 E;) = 1, so they're all equal. 

It seems zero coolness is impossible for a system with infinitely many states. With 
infinitely many states, all equally probable, the probability of being in any state would 
be zero. In other words, there's no uniform probability distribution on an infinite set. 

One way out: replace sums with integrals. For the usual measure on [0, 1], called 
the Lebesgue measure dx, we have J dx = 1. So this is a ‘probability measure’ that 
we could use to describe a system at zero coolness, whose space of states is [0, 1]. 

But replacing sums by integrals raises all sorts of interesting issues. For example, 
there’s a unique way to sum over a finite set of states, but an integral over an infinite 
set of states depends on a choice of measure. So a choice of measure is a significant 
extra structure we're slapping on our set of states. 

We'll need to think about these issues later, since to compute the entropy of a clas- 
sical ideal gas we'll need integrals. But we'll encounter difficulties, which are ultimately 
resolved using quantum mechanics. 

Anyway: infinite temperature is really zero coolness, and at least for systems with 
finitely many states, the entropy becomes as large as possible at zero coolness. 
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NEGATIVE TEMPERATURE 


If a system has finitely many states with energies E;, 
in thermal equilibrium at temperature T' 
the probability that it’s in the zth state is 


pas 


where 8 = 1/kT and К is Boltzmann’s constant. 


When 8 < 0 the system becomes ‘hotter than infinitely hot’. 
Its temperature is negative—but the higher the energy of a state, 
the more probable it is! 


A system with finitely many states can reach infinite temperature. It can get even 
hotter, but then its temperature ‘wraps around’ and become negative! 

The possibility of negative temperatures was first discussed by the physicist Lars 
Onsager in 1949, and they have been created in the lab with a variety of systems 
that—within some approximation—can be described as having finitely many states. In 
quantum theory, this happens for systems that have finite basis of ‘energy eigenstates’: 
states with well-defined energies E;. For example, the nucleus of an atom may have just 
two spin states, and if we put it in an magnetic field these will have different energies. 
The result is the system we studied in Puzzle 25. 

Here is a generalization with more energy states, all equally spaced: 


Puzzle 26. Consider a system with 2N + 1 states labeled by an integer n with -N < 
n < №, where the nth state has energy En = an for some energy a > 0. Compute 
the Boltzmann distribution for this system at coolness 8 for all 6 € IR. Compute the 
expected energy (Е) as a function of 8. What is the qualitative difference in your result 
between the case of positive temperature (3 > 0) and negative temperature (8 < 0)? 


For more, try this: 


• Wikipedia, Negative temperature. 
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ABSOLUTE ZERO: THE LIMIT OF INFINITE COOLNESS 


If a system with finitely many states having energies E; is in thermal 
equilibrium, the probability p; that it’s in the ith state is propor- 
tional to exp(—/ E;) where 8 is the coolness. 


In the limit of infinite coolness, 9 — +оо, these probabilities go to 
zero except for the states of lowest energy, which all become equally 
probable. 


The limit 8 — 4-oo is also the limit where Т approaches zero from 
above, commonly called absolute zero. 


The limit where Т' approaches zero from above is often called absolute zero. Why? 
First people made up various temperature scales like Celsius, where zero was the freez- 
ing point of water, and Fahrenheit, where zero is the freezing point of a mixture of 
water, ice, and ammonium chloride. But researchers discovered that nature had a more 
fundamental concept of zero temperature: the limit of infinite coolness! This happens 
as the temperature approaches —273.15 °C, or roughly —459.67 °F. This discovery led 
Kelvin to propose a shifted version of Celsius where zero is absolute zero. This was 
originally called ‘absolute Celsius’, but now it is called the Kelvin scale. This is the 
scale of temperature I'll always use here. The size of the degrees is a somewhat arbitrary 
convention, but the zero is not: it's absolute zero. 
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THE HAGEDORN TEMPERATURE 


If a system has a countable infinity of states n = 1,2,3,... 
with energies E,,, the Boltzmann distribution 


exp( — E, /kT) 


и Y exp(— E, /KkT) 


n=1 


is either: 


1) defined for all 0 < T < +оо 
2) undefined for all 0 < Т < +оо 


3) defined for all 0 < Т < Ty but not for Ty < T < +оо, where 
Ty is some temperature called the Hagedorn temperature. 


We've been discussing systems with finitely many states, but many physical systems 
have a countable infinity of states. So let's think a bit about those. We can copy 
everything we've done so far, but we have to be careful. For thermal equilibrium to be 
possible at some temperature 7', we need the Boltzmann distribution 


exp( — E,/ kT') 


Dn = os 
5 exp(—E,,/kT) 
n=1 


to make sense. But it might not. Sometimes the sum fails to converge! This happens 
when the terms exp(— £,/kT') don’t go to zero fast enough as n — +00. 
Let's investigate this issue. We'll assume that 


x exp(—E,,/kT) 


converges for some Т > 0. Then the energies £,, must be bounded below: otherwise 
the terms exp(—E,,/kT) will get bigger and bigger. Furthermore for any E € R there 
can be at most finitely many E, less than E: otherwise we'd be adding up infinitely 
many terms greater than exp(—E/kT). As a result, we can reorder the states so their 
energies are nondecreasing: 


Еу < Ea < Ez << 


and Е, > +оо. 

Reordering a sum can’t change its convergence or value if it’s a sum of nonnegative 
numbers, like the sum we have here. So we might as well assume we've listed the 
energies in nondecreasing order as above. Then there are two cases: 


1. The energies Е, approach +оо so fast that °°, exp( — E,/KkT) converges for all 
0 « T < +оо. Then our system can be in thermal equilibrium at any finite posi- 
tive temperature. This is the nicest situation, and the one we typically expect.. 
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2. The energies ЕЁ, approach +оо slowly enough that У) exp(—En/kT) converges 
when T' is small enough, but not otherwise. In this case there is some temperature 
Ty, called the Hagedorn temperature, such that our system can be in thermal 
equilibrium at positive temperatures Т below Ty, but not above Тн. 


In both cases, X} exp( — £,/kT) diverges for all —oo € T < 0 and T = +оо. So, 
for a system with a countable infinity of states, if thermal equilibrium exists for some 
positive temperature, it cannot exist for negative or infinite temperatures. 

The second case is weird and interesting. It’s named after Rolf Hagedorn, who in 
1964 noticed that this was a possibility in nuclear physics. He considered а model of 
nuclear matter where the energies Ё„ approach +оо in a roughly logarithmic way. As 
you heat it, its expected energy keeps increasing, but its temperature can never exceed 
Ty. This model turned out to be incorrect, but it's interesting anyway. 

Now let's solve some puzzles on systems with a countable infinity of states. Some of 
these show up in quantum mechanics, but you don't need to know quantum mechanics 
to do these puzzles. 


Puzzle 27. Show that for a system with a countable infinity of states, if thermal equi- 
librium is possible for some negative temperature, it is impossible for positive or infinite 
temperatures. 


Puzzle 28. Work out the Boltzmann distribution when E, = n E for some energy E, 
and show that it is well-defined for all temperatures 0 < Т < +оо. 


The next puzzle is a lot like the previous one—a bit more messy, but worthwhile 
because of its great importance in physics. 


Puzzle 29. For a system called the quantum harmonic oscillator of frequency w we have 
E, = (n + 5)hw, where ñ is the reduced Planck’s constant. Work out the Boltzmann 
distribution in this case, and show it is well-defined for all temperatures 0 < Т < +оо. 


Puzzle 30. For a system called the primon gas we have Е, = Elnm for some energy 
E. Show that the Boltzmann distribution is well-defined for small enough positive 
temperatures, but there is а Hagedorn temperature. Give a formula for the Boltzmann 
distribution in terms of the Riemann zeta function: 


You can show that for the primon gas the sum °°, exp(—E,/kT) diverges at the 
Hagedorn temperature. But it can go the other way, too: 


Puzzle 31. Find energies E, with a Hagedorn temperature such that У; exp( — E, / kT) 
converges at the Hagedorn temperature. 


Various other strange things can happen, as you should expect when dealing with 
infinite series. For example, it's possible that the Boltzmann distribution is well-defined 
at some temperature but the expected value of the energy is infinite! But I'll resist the 
lure of these rabbit holes and turn to something much more important: systems with 
a continuum of states. We will need to get good at these to compute the entropy of 
hydrogen. Now our sums become integrals, and various new things happen. 
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THE FINITE VERSUS THE CONTINUOUS 


THE FINITE THE CONTINUOUS 


р a probability distribution р a probability distribution 
оп {1,... n} on R 


Gibbs entropy Gibbs entropy 


S(p =-k Y pilnpi S(p) =—k |” p(w) np(2) dz 


i=1 


S(p) always > 0 S(p) not always > 0 
S(p) always finite S(p) not always finite 


S(p) invariant under S(p) not invariant under 
permutations of {1,...,} reparametrizations of R 


You can switch from finite sums to integrals in the definition of entropy, and we'll 
need to do this to compute the entropy of hydrogen. But be careful: a bunch of things 
change! 

We need to switch from finite sums to integrals when we switch from a finite set of 
states to a measure space of states. ГЇ illustrate the ideas with the real line, IR. We 
define a probability distribution on R to be an integrable function p: IR — [0, оо) with 


T р(ж)ах = 1. 


—co 


Such a probability distribution has a Gibbs entropy given by 
= —k ГИС ж)1пар(т)ах. 


We can also define Shannon entropy, where we leave out Boltzmann's constant Ё and 
use whatever base we want for the logarithm: 


HQ) = — | v) og p(x) dz. 
I should warn you that many writers reserve the term ‘Shannon entropy’ only for a sum 


H(p) = — У? pi log pi. 


ic X 


While that convention has advantages, I want to use the term ‘Shannon entropy’ to 
signal that I’m leaving out the factor of k. 

Unlike the entropy for a probability distribution on a finite set, the entropy of a 
probability distribution on IR can be negative! This is disturbing. Earlier I said that 
the Shannon entropy of a probability distribution is the expected amount of information 
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you learn when an outcome is chosen according to that distribution. How can this be 
negative? 

The answer is that this interpretation of entropy, valid for probability distributions 
on a finite or even a countably infinite set, is not true in the continuous case! We have 
to adapt our intuitions. 

Look at an example. Let p. be the probability distribution on IR given by 


1, 

= Ї0<хж<є 
pex) = а 
0 otherwise. 


For small e it's a tall thin spike near 0. Let's work out its Shannon entropy: 


H(p) = = | pæ) logp(a) dz 
eye ud 
= — | -log-dzr 
€ 
= loge. 


We're just integrating a constant here, so it's easy. When e = 1 the entropy is zero, 
and when є becomes smaller than 1 the entropy becomes negative! 

Why? We need a distance scale to define the entropy of a probability distribution on 
the real line. If I measure distance in centimeters, Г think the entropy of a probability 
distribution is bigger than you, who measures it in meters. And if I measure it in 
kilometers, ГЇ think the entropy is smaller—and possibly even negative. 

Let's see how this works. If I measure distance in different units from you, my 
coordinate y on the real line will not equal your coordinate x: instead we'll have 


= ep 
for some c > 0. Then my probability distribution, say q, will have 
] «а= | аст) (e) =с f" aen) dz 
so we must have ў 
(сп) = ро) 


to make this integral equal 1. In other words, stretching out a probability distribution 
must also flatten it out, making it less ‘tall’—and its entropy increases. Indeed: 


Puzzle 32. Show that H(q) = H(p) + Inc. 


Thanks to this formula choosing 0 < c < 1 compresses a probability distribution 
and makes it taller, reducing its entropy. Inevitably, this can make the entropy negative 
if c is small enough. 

In summary: in the continuous case, entropy is not invariant under reparametriza- 
tions: our choice of coordinates matters! And this can make entropy negative. This 
applies not only to IR but many other measure spaces we'll be considering, like К”. This 
issue will be very important. 

After learning this, it should be less of a shock that the entropy of a probability 
distribution on IR can be infinite, or even undefined: 


Puzzle 33. Find three probability distributions p on the real line that have entropy 
+оо, —оо, and undefined because it’s of the form +00 — oo. 
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ENTROPY, ENERGY AND TEMPERATURE 


Suppose a system has some measure space X of states with 
energy E: X — R. In thermal equilibrium the probability 
distribution on states, p: X — К, maximizes the Gibbs entropy 


S=-k [a p(z)ln p(x) dx 


subject to a constraint on the expected value of energy: 


(E) = | p(w) E (v) dz 


Typically when this happens p is the Boltzmann distribution 


e- EG)/kr 


/ e EGET. qa, 
X 


p(x) = 


where T is the temperature and k is Boltzmann’s constant. 


Then as we vary (E) we have 


d(E) = TdS 


We can now generalize a lot of our work from a finite set of states to a general 
measure space. I won’t redo all the arguments, just state the results and point out a 
couple of caveats. 

For any measure space X we say a function p: X — [0, оо) is a probability distri- 
bution if it's measurable and 

ГА plx)dz = 1. 


We can define a version of Shannon entropy for p by 


H=- | p) log p(x) ar, 


but physicists mainly use the Gibbs entropy, defined by 


S=-k f x x) In p(z 


As I warned you last time, this can take values in [—оо, оо], though we are mainly 
interested in cases when it's finite. If we think of X as the space of states of some 
system, we can pick any measurable function E: X — К and call it the ‘energy’. Its 
expected value is then 


at least when this integral converges. 
We say the probability distribution p describes thermal equilibrium if it maximizes 
S subject to a constraint (E) — c. Typically when this happens p is a Boltzmann 
distribution 
e 8EG) 
p(z) = m 
[| e PEG) dg 
X 


where 8 is called the coolness. I say ‘typically’ because even when X is a finite set, 
we saw in Puzzle 24 that there can be thermal equilibria that are not Boltzmann 
distributions, but only limits of Boltzmann distributions as 6 — +оо or B — —oo. 
This can also happen for other measure spaces X. I will not delve into this, because 
my goal now is to get to some physics. 

As before, we can write 6 = 1/kT, at least if 6 5 0, and then write the Boltzmann 


distribution as 
e- E/kT 


p(x) = 
I е-Е(®/ЁТ dx 
X 


Also as before, the Boltzmann distributions obey the crucial relation 


dH = Ва(Е). 


Rewriting this in terms of Gibbs entropy S = ЁН and temperature Т = 1/k, it 
becomes this famous relation between temperature, entropy and the expected energy: 


Таз = d(E). 


Notice that the units match here. The Shannon entropy H is dimensionless, but 
since k has units of energy/temperature, the Gibbs entropy S = ЁН has units of 
energy/temperature. Thus TdS has units of energy, as does d(E). 
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THE CHANGE IN ENTROPY 


As we change the temperature of a system from T; to Т 
while keeping it in thermal equilibrium, the change in its 
entropy is 


n d(E) 
ES 


sq) - sq = f 


where (E) is its expected energy at temperature Т. 


Last time we saw that as we change the expected energy (Е) of a system while 
keeping it in thermal equilibrium, this fundamental relation holds: 


TdS = d(E). 
We can rewrite this as IE) 
dS = E 


and then integrate this from one temperature to another—remember, as the expected 
energy changes, so does the temperature. We get 


Ti 
Te = sq) - sq). 
т T 

This is the main way people do experiments to ‘measure entropy. Slowly heat 
something up, keeping track of how much energy it takes to increase its temperature 
each little bit. Using this data you can approximately calculate the integral at left—and 
that gives the change in entropy! 

But so far we're just measuring changes in entropy. How can you figure out the 
actual value of the entropy? One way is to assume the Third Law of Thermodynamics, 
which says that in thermal equilibrium the entropy approaches zero as the temperature 
approaches zero from above. This gives 


һ d(E) 
| == 90) 


This is how people often ‘measure the entropy’ of a system in thermal equibrium. They 
heat it up starting from absolute zero, very slowly so—they hope—it is close to thermal 
equilibrium at every moment—and they take data on how much energy is used, and 
approximately calculate the integral at left! 

But this relies on the Third Law of Thermodynamics. бо where does that come 
from? 
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THE THIRD LAW OF THERMODYNAMICS 


If a system has countably many states, 
with just one of lowest energy, 
and thermal equilibrium is possible for this system 
for some temperature T > 0, 
then its entropy in thermal equilibrium approaches zero 
as T' approaches zero from above: 


lim S(T) = 0 
т— 0+ 


Some people say the Third Law of Thermodynamics this way: “entropy is zero at 
absolute zero". But it's not really that simple—indeed, other people say it's impossible 
to reach absolute zero. Above I've stated a version of the Third Law that's actually a 
theorem. Let's prove it! 

Actually, let's prove it now for systems with only finitely many states. It'll be easier 
to handle systems with countably infinite number of states later, when we've developed 
more tools. And by the way, we'll see the Third Law isn’t always true for systems with 
a continuum of states. It will fail for all three of the problems on our big to-do list: 
the classical harmonic oscillator, the classical particle in a box and the classical ideal 
gas. This is often taken as a failure of classical mechanics, since switching to quantum 
mechanics makes the Third Law hold for these systems. 

Let's show that for a system with finitely many states 1 = 1,...,n with energies 
Ej, as the temperature T approaches zero from above, the entropy of the system in 
thermal equilibrium approaches k ln N where N is the number of lowest-energy states. 
In thermal equilibrium 

p; x e PET. 
Thus, all states with the lowest energy have the same probability, while as the temper- 
ature approaches zero from above, any higher-energy states have p; — 0. So, as the 
temperature approaches zero from above, the probability of the system being in any 
one of its N lowest-energy states approaches 1/N, and we get 


; : cd 1 
jim, S(T) E x el (5) = kln N. 


In particular, if the system has just one lowest-energy state, we get the Third Law of 
Thermodynamics: 

lim S(T) =0. 

т—0+ 
Неге Т — 0* means that Т is approaching zero from above. 

But beware: for systems with lots of lowest-energy states, their entropy in thermal 
equilibrium can be large even near absolute zero! Also, a related problem: systems 
may take a ridiculously long time to reach equilibrium! This is especially true for 
systems that have many states whose energies are very near the lowest energy, like a 
piece of glass. You can put a piece of glass in a fancy refrigerator and try to cool it 
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to near absolute zero. If it has one lowest-energy state, its entropy should approach 
zero. If this happened, the glass would change from glass to a crystal, which has less 
entropy. But nobody has seen glass turn into a crystal when they cool it down. If this 
happens, it probably does so only after an absurdly long time, much longer than the 
age of the Universe. This phenomenon is called ‘frustration’. People like to argue about 
frustration and the Third Law, so I am not trying to give you the final word here, just 
alert you to the issue. You can learn a bit more here: 


• Wikipedia, Third law of thermodynamics. 


By the way: for systems with finitely many states, it's possible to have negative 
temperatures, and the Third Law has a counterpart saying what happens when the 
temperature approaches zero from below: 


Puzzle 34. Show that for a system with finitely many states, 


lim S(T) = kln M 


т-0- 


where M is the number of states of highest energy. 


However, most systems we'll be studying won't have a state of highest energy. 
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MEASURING ENTROPY 


If we assume the entropy of a system approaches zero as T' 
approaches zero from above, we have 


=т= S(T1) 


i‘ d(E) 


Using this assumption, we can do experiments to measure 
the entropy of different substances 
at standard temperature and pressure: 


e iron: ~5 bits per atom 
• water: ~12 bits per molecule 


e hydrogen: ~23 bits per molecule 


People actually do experiments and use the above formula to figure out the entropy 
of many substances in thermal equilibrium assuming their entropy vanishes as the tem- 
perature approaches absolute zero. They slowly heat up a substance and keep track of 
how much energy is needed to raise its temperature as they go, so they can approxi- 
mately calculate the integral shown. They usually report the answers in joules/kelvin 
per mole, but I enjoy ‘bits per molecule’. 

As you can see, liquids tend to have more entropy than solids, and gases tend to 
have even more. My goal in this course is to teach you how to approximately compute 
some of these entropies from first principles. Unfortunately the only substances that 
are simple enough for us to handle are gases. 

This is a good opportunity to explain some conventions. A mole is defined to 
be exactly 6.02214076 - 10?°—this is called Avogadro’s number, and it’s close to the 
number of hydrogen atoms in a gram. A joule/kelvin of Gibbs entropy corresponds to 
about 7.242297 - 102? nats of Shannon entropy: the number here is the reciprocal of 
Boltzmann’s constant, which is defined to be exactly 1.380649 - 10723 joules per kelvin. 
A bit is In2 nats. From these three facts, we see 1 joule/kelvin of Gibbs entropy per 
mole corresponds to about 0.173516 bits/molecule of Shannon entropy. 

By the way, what is ‘standard temperature and pressure’? Annoyingly, this phrase 
means different things to different organizations. I will try to always use it to mean 
a temperature of 298.15 K and a pressure of 1 bar. The temperature here equals 25 
°C, which seems a bit random compared to 0 °C—but convenient, because it’s close to 
room temperature. A pressure of 1 bar, or more officially 100 kilopascals, is slightly less 
than a ‘standard atmosphere’, which is a unit of pressure intended to equal the average 
air pressure at sea level. A pascal is an official SI unit: it’s a pressure of one newton 
per square meter. 
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THE EQUIPARTITION THEOREM 


Suppose the energy of a system with n degrees of freedom is a 
positive definite quadratic form E: R” — R, for example 


n cix? 


E(x) = X — ci > 0 


i=l 2 


Then in thermal equilibrium at temperature T, 
the expected value of the energy is 


1 


where k is Boltzmann’s constant. 


Temperature is very different from energy. But sometimes—not very often, but 
sometimes—the expected energy of a system in thermal equilibrium is proportional to 
its temperature. The equipartition theorem says this happens when the energy depends 
quadratically on several real variables, defining a positive definite quadratic form on 
R”. For example, it happens for a classical harmonic oscillator. 

Some people get confused and try to apply the equipartition theorem where it 
doesn’t apply. They foolishly conclude that temperature is always proportional to 
energy. 

This theorem does not apply to quantum systems. Indeed, when people tried to 
apply the equipartition theorem to a mirrored box of light they ran into a problem 
called the ultraviolet catastrophe. Classically the box of light is a system where the 
energy is a positive definite quadratic form, but n = oo, so they got an infinite expected 
value of the energy! Quantum mechanics saves the day and makes the answer finite. 


The "Ultraviolet 
Catastrophe" 


М 5000K 


Radiation 


*. Rayleigh- 
* Jeans Law 
* 
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Еогти!а * 
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*. 
". 
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Wavelength of radiation іп nm 


The equipartition theorem also doesn't apply to classical systems unless the energy 
is quadratic. So it's very limited in its applicability, but still useful. 

Let's prove this result! To prove a theorem, you have to understand the definitions. 
We'll start with some background. 
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THE EQUIPARTITION THEOREM—BACKGROUND 


Suppose the energy of a system with n degrees of freedom is some 


Suppose p: 


function 


E:R"—R 


Let k be Boltzmann's constant. 


R” > 


R is a probability distribution maximizing the 


entropy 


S=-k ra p(a) ln p(x) d"x 


subject to a constraint on the expected energy 


(Е) = | E(@)p(@) da 


Then p must be a Boltzmann distribution: 


p(x) = 
] e 8E) qna 


for some number 8 > 0. 


The temperature T' is defined so that 8 = 1/kT. 


We're defining entropy with an integral now, unlike a sum as before, and sticking 
Boltzmann's constant into the definition of entropy, as physicists do, so that entropy 
has units of energy over temperature. 

Given the formula for the energy E as a function on К", we'll have to find the 
Boltzmann distribution and then compute (Е) as a function of Т. 
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PROOF OF THE EQUIPARTITION THEOREM: 1 


Special case: a system with 1 degree of freedom where the energy E: 
is E(x) = 2? /2. 
The Boltzmann distribution is 
e 8E(x) e 07/2 


p(x) = oo = o6 3 
/ e PE@) dæ / е—8®/2дх 


—oo —oo 


so the expected energy is 


zu Baz? e- 87/2 dx 


T e 88" 2405 


—oo 


(E) = [^ Е(а)р(а) de = PM _ 28 


so doing a substitution with u? = В22: 


i eo" du =j ure" du = V 20 


We'll do two special cases before proving the general result. First let's do a system 
with 1 degree of freedom where the energy is E(x) = z?/2. In this case, after a change 
of variables, the Gibbs distribution becomes a Gaussian with mean 0 and variance 1, 
and that gives the desired result. Or just do the integrals and see what you get! The 
expected energy (Е) is БЕТ. 
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PROOF OF THE EQUIPARTITION THEOREM: 2 


More general case: a system with n degrees of freedom where the energy 
E: R” — Ris 


1 л 
E(z) = lel? = > 
i=1 


We can reduce this to the case with 1 degree of freedom: 
n 1 ||? e-lel*/2 qna; n | 1g? e-Blzl*/2 qna, 
к? =o В" 


n е—В\®|?/2 grap Í e- Plal?/2 gra 


(E) = 
i=1 


Next we do a system with n degrees of freedom where the energy is a sum of n terms 
of the form 22/2. It’s no surprise that each degree of freedom contributes IkT to the 
expected energy, giving 


(Е) = ЕТ 


But make sure you follow my calculation above. I skipped a couple of steps! 
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PROOF OF THE EQUIPARTITION THEOREM: 8 


General case: a system with n degrees of freedom where 
the energy E: К” — R is any positive definite quadratic form. Then 


1 
Еа) = zl Al? 


for some invertible n х n matrix A. When A is a diagonal matrix this gives 


n cx? 
Ва) => 


i=1 


c; > 0 


We can reduce the general сазе to the previous case by a change of variables 


y = Ах: 
| 1l Ат]? 2141272 qna; / Lily |[2 е8112/2 quy 
E) — К" == R^ 


| е-814812/2 qna, Í е—Б\ш\?/2 дт, 


Finally let's do the general case. A quadratic form on R” is а map Q: R” —^R 
such that 


Q(x) = У 9:32:23 


ij=l 


for some numbers q;; € К. We say it's positive definite if 


xs) = О(т) > 0. 


One can prove that a quadratic form Q: IR" — К is positive definite if and only if 


Qla) = 5l Asl? 


for some invertible n xn matrix A. The factor of 1/2 here is just to make our calculations 
easier. 

Thanks to this, if we have a system whose space of states is IR" and its energy 
function E: IR" — R is a positive definite quadratic form, we can compute 


] EGO exbC-8E(2)) dz 
J exe EG) dz 


(В) = 


by reducing it to the previous case using a change of variables. We get 
1 
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5o, each degree of freedom still contributes IkT to the expected energy. That's the 
equipartition theorem! 

But be careful. The equipartition theorem doesn't apply when the energy is an 
arbitrary function of n variables. It also fails when we switch from classical to quantum 
statistical mechanics. 

People sometimes memorize the conclusion of the equipartition theorem, E — InkT А 
without learning that it holds only for classical systems whose energy is а positive def- 
inite quadratic form. These people sometimes get fooled into thinking (E) is always 
proportional to T'. Some of these poor benighted souls go around saying that temper- 
ature is just a measure of energy per degree of freedom. This completely ignores the 
subtlety of the concept of temperature. 

As we've seen, the truly general relation between temperature and energy, for sys- 
tems in thermal equilibrium, also involves entropy: 


TdS = d(E). 
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THE AVERAGE ENERGY OF АМ АТОМ 


Since an atom of helium gas can move in 3 directions, and its 
energy depends quadratically on its velocity and not on position, 
the equipartition theorem says that classically its expected energy 
should be 


(E) — “kT 


where T is temperature and k is Boltzmann’s constant, about 
1.38 - 10723 joules/kelvin. 


So, heating an atom of helium gas 1 ?^C should take 


3 
215 1.38 - 107? joules = 2.07 - 10 ?? joules 


This is very close to the truth. 


We can finally start reaping the rewards of all our thoughts about entropy! The 
equipartition theorem lets us estimate how much energy it takes to heat up one atom 
of helium one degree Celsius. And it works! 

Of course we don't heat up an individual atom: we heat up a bunch. A mole of 
helium is about 6.02 · 102° atoms, so heating up a mole of helium one degree Celsius (= 
1 kelvin) should take about 


6.02 - 10” x 2.07 - 10 7? = 12.46 joules 


And this is very close to correct! It seems the experimentally measured answer is 12.6 
joules. 

What are the sources of error? Most importantly, our calculation neglects the 
interaction between helium atoms. Luckily this is very small at standard temperature 
and pressure. We’re also neglecting quantum mechanics. Luckily for helium this too 
gives only small corrections at standard temperature and pressure. 

It’s important here that helium is a monatomic gas. In hydrogen, which is a diatomic 
gas, we get extra energy because this molecule can tumble around, not just move along. 
We'll try that next. 
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THE ENERGY OF HYDROGEN 


e 


If we treat a molecule of hydrogen as a dumbbell whose position 
takes 3 numbers and whose axis whose takes 2 numbers to 
describe, we can try to use the equipartition theorem to estimate 
its expected energy as 


(E) — SAT 


where Т is temperature and k = 1.38 - 10723 joules/kelvin. 


In this approximation, heating a molecule of hydrogen gas 
1 kelvin takes 


5 
a 1.38 - 107? joules = 3.45 - 10 ?? joules 


In reality it takes 3.39 - 10—23 joules 
at standard temperature and pressure. Not bad! 


A molecule of hydrogen gas is a blurry quantum thing, but let’s pretend it’s a 
classical solid dumbbell that can move and tumble but not spin around its axis. Then 
it has 3+2 = 5 degrees of freedom, and we can use the equipartition theorem to estimate 
its energy. 

For T significantly less than 6000 kelvin, hydrogen molecules don’t vibrate with the 
two atoms moving toward and away from each other. They don’t spin around their axis 
until even higher temperatures. But they tumble like a dumbbell as soon as Т' exceeds 
about 90 kelvin. 

We need quantum mechanics to compute these things. But at room temperature 
and pressure, we can pretend a hydrogen gas is made of classical solid dumbbells that 
can move around and tumble but not spin around their axes! In this approximation 
the equipartition theorem tells us (E) = 3kT. 

This is fine as far as it goes—but our goal in this course is to compute the entropy 
of hydrogen. We'll start with a useful warmup: the classical harmonic oscillator. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 1 


A classical harmonic oscillator has energy 
2 T 
g=? 54 
2m 2 
where q is its position, p its momentum, m its mass and к its spring constant. 


By the equipartition theorem, in thermal equilibrium at temperature T' 
it has expected energy (E) = kT where k is Boltzmann’s constant. 


So, using d(E) = TdS, its entropy is 


s= fas- | 22 k “шт + C) 


Since this does not approach О as T' — О from above, the Third Law of 
Thermodynamics doesn't hold for the classical harmonic oscillator. 


But what is this constant C? 
For that we must think harder. 


What's the entropy of a classical harmonic oscillator in thermal equilibrium? Using 
the equipartition theorem and the formula d(E) = TdS, we can show it’s 


S = k(In T +С). 


So, the entropy grows logarithmically with temperature. And it does not go to zero as 
T approaches zero: instead, it goes to negative infinity. So the Third Law of Thermo- 
dynamics does not hold for the classical harmonic oscillator! 

That may seem shocking, but it actually makes sense. The Third Law holds only for 
certain special systems. Furthermore, we've seen that the entropy of a sharply peaked 
probability distribution on a continuous state space is negative. We'll see that the 
Boltzmann distribution for the classical oscillator gets more and more sharply peaked 
near q = p = 0 as the temperature approaches zero from above. So in fact, it makes 
perfect sense that the entropy approaches —oo. 

However, the classical harmonic oscillator is just an approximation to the quantum 
harmonic oscillator, which does obey the Third Law. It's a good approximation at high 
temperatures, but bad at low temperatures. In fact, this business of negative entropies 
at low temperature is not something that happens in the real world. It's just a defect 
of classical mechanics. It's trying to tell us that quantum mechanics is better. 

Another point: you'll have noticed that constant C here. What is it? We can make 
progress with a bit of dimensional analysis. The quantity In T' is a funny thing: if we 
change our units of temperature, it doesn't get multiplied by a constant factor, the way 


55 


physical quantities usually do. It changes by adding a constant! So kln Т doesn't have 
dimensions of entropy. But 
S = k(In T +C) 


must have dimensions of entropy. The constant C must somehow save the day! How 
does that work? Let's see. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 2 


Classically, a harmonic oscillator at temperature Т' has entropy 
S — k(In T + C) 


Writing C = — ln(To) for some constant То, this gives 


S = kIn(T/T;) 


Dimensional analysis implies Tọ must have units of temperature! 


But what is this temperature Tọ? For that we must think harder. 


The formula 5 = k(In T + C) is a bit scary from the viewpoint of dimensional 
analysis. We usually avoid working with the logarithm of a dimensionful quantity, like 
In T', because it transforms in a funny way when we change our units. But if we write 
C = – І Ту then we get S = kln(T/Tọ), and we see the solution to our problem! If 
To has units of temperature, then Т/Т is dimensionless, so In(T/To) doesn't change 
at all when we change our units. In other words: now In(T'/To) is dimensionless, so 
S = kln(T/To) has units of entropy as it should. 

So, the constant C must equal — In To for some temperature To that we can compute 
for any harmonic oscillator. What is it? We could just compute it. But it's more fun 
to use dimensonal analysis. 

What could it depend on? Obviously the mass m, the spring constant & and Boltz- 
mann's constant k. But there's no way to form a quantity with units of temperature 
from just т, к and k. So we need an extra ingredient. 

And we've seen this already: to define the entropy of a probability distribution on 
the q, p plane and get something with the right units, we need a quantity with units of 
action! The obvious candidate is Planck's constant A, and this is actually right. The 
entropy we're after is given by an integral with respect to dp dq/h where h = h/27. So 
the temperature Ту can depend on A as well as m, капа k. 

We can compute a quantity with units of temperature from m, &, апа A. The 
frequency of our oscillator is w = ,/k/m, and it's a famous fact that hw has units of 
energy. k has units of energy/temperature... so hw/k has units of temperature! 

Thus, our temperature Tọ must be hw/k times some dimensionless purely mathe- 
matical constant, which ГЇЇ call 1/a. o must be something like л or 2, or if we're really 
unlucky, e996 —though in practice we usually get numbers fairly close to 1, not huge or 
tiny numbers. 

So, the entropy of a classical harmonic oscillator must be 


S = kln(T/To) = k In(akT/ħw). 


This is far as I can get without breaking down and doing some real work. Later we will 
compute a. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 3 


We've seen a classical harmonic oscillator 
with frequency w has entropy 


S = kln(akT/hw) 


when it’s in thermal equilibrium at temperature Т. 


Here k is Boltzmann's constant, 
h is Planck's constant, 
and o is some dimensionless mathematical constant. 
We'll figure it out later. 


Even though we don’t know a, this formula is already very interesting! kT is known 
to be the typical energy scale of thermal fluctuations at temperature T. hw is the 
spacing between energy levels of a quantum harmonic oscillator with frequency ш. The 
ratio kT/ħw is therefore roughly the number of energy eigenstates in which we may 
find a quantum harmonic oscillator with high probability when it's at temperature Т. 

Thus, S is roughly Ё times the logarithm of the number of states that we're likely 
to find a quantum harmonic oscillator in, when it's at temperature Т. This may seem 
mysterious. After all, we weren't trying to do quantum mechanics, much less count 
quantum states. But there are other examples of this pattern, where the entropy of 
a classical system in thermal equilibrium at temperature T'is roughly Ё times the 
logarithm of the number of quantum states that are accessible at temperature Т. 

We'll learn a bit more about this later, when we relate entropy to something called 
the ‘partition function’, which can be understood as the ‘number of acccessible states’. 
This viewpoint will also explain the constant a. But now let’s calculate this constant. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 4 


A classical harmonic oscillator has energy 


2 2 
p Kq 
E(p, q) = zm + ә 


where p is its momentum, q its position, m its mass апа к its 
spring constant. 


At temperature Т, the probability density of its momentum and 
position 
is the Boltzmann distribution: 


| ГА e PE») dp dq 
— OO J — OO h 


p(p,q) = 


where 8 = 1/kT, К is Boltzmann's constant, 
and h = 27h is the original ‘unreduced’ Planck’s constant. 


The oscillator’s entropy at temperature T is thus 


S = -k f | _p(p,a) n p(p,a) р 


Last time we found a formula for the entropy of а classical harmonic oscillator... 
which includes a mysterious purely mathematical dimensionless constant o. Now let's 
figure out a. 

We'll grit our teeth and actually do the integral needed to calculate the entropy—but 
only in one easy case! Combining this with our formula, we'll get a. 

First, recall the basics. The energy E(p, q) of our harmonic oscillator at momentum 
p and position q determines its Boltzmann distribution at temperature T', which I'll 
call p(p,q) now since the letter p is already being used. Integrating —plnjp using 
the measure dpdq/h we get the Shannon entropy. In physics we multiply this by 
Boltzmann's constant Ё to get the Gibbs entropy. 


59 


ENTROPY OF THE HARMONIC OSCILLATOR: 5 


We can choose units of length, time, mass and temperature 
to make a classical harmonic oscillator's mass m, its spring 
constant к, Boltzmann’s constant k and Planck’s constant A all 
equal 1. 


Then at T' = 1 the Boltzmann distribution of the oscillator is 
—(p*+q")/2 
p(p,q) = ы = eG ray 


i |. е-—(Ф?*+ч°)/2 dpdq 
—ooJ —oo 2T 


so its entropy is 


S = -f a e QI t2)/214 (e Н) dpdq 
—00/ — 00 27 


Let's do this integral! 


Let's compute the Boltzmann distribution p(p,q) and the entropy S. To keep the 
formulas clean, we'll work in units where m = к = k = h = 1, and compute everything 
at one special temperature: T' — 1. 

In this setup À — 27, and 


e 8E(pa) — o-(?)/2 


is a beautiful Gaussian with integral 


|» T e G^ +Ф)/? — 2л, 


These two factors of 27 cancel when we compute the denominator of the probability 
distribution p(p, q): 
Ѓ a „—(р?+)/2 dpdq _ 27 _ 
—007 —oo0 27 27 
Thus, we get simply 
р(р,а) = нн, 
The entropy of the harmonic oscillator is thus 


eus / ii li MORAL TCU арад 
—oo 4 —oo 2T 


when « = k = ħ = T = 1. Next let's do this integral. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 6 


Whenm —&-—k-—h-T-1 
the entropy of a classical harmonic oscillator is 


-f T є-Ф?+4?)/З In (e7 6^2) араа 
—0o4 —oo 2T 


1 


2т 


2т гоо p? 
n 3677 ? rdrd0 (switching to polar) 
o Jo 


(doing the 0 integral) 


(substituting u = r?/2) 


Now let's do the integral to compute the entropy of the harmonic oscillator. We 
copy a famous trick for computing the integral of a Gaussian. First we switch to polar 
coordinates in the pq plane, where 


т? = р? +? and арад = rdrd6. 


Then we integrate with respect to 0, which cancels out the factor of 1/27. Then we do 
a substitution u — r?/2. But for us r?/2 is minus the logarithm of the Gaussian: 
г? 
=e 6.6 


so we're left with = 
S zi ue "du 
0 


which we can do with integration by parts. 
After all this work, we get an incredibly simple answer: 


ped 


So in the special case where m K k h= Т 1, the entropy of a classical 
harmonic oscillator in thermal equilibrium is 1. 
Now let’s return to the general case, and finish the job. 
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ENTROPY OF THE HARMONIC OSCILLATOR: 7 


A classical harmonic oscillator with frequency w has entropy 


S = kin(akT/hw) 


for some dimensionless constant a. 


But when m = к = К = ћ = Т = 1 
we һауе w = 1 and S = 1, so we must һауе 


а — € 


and thus finally 


Knowing the entropy in one special case, we can figure out the constant o in our 
general formula for the entropy. Our general formula says 


S = kin(akT/hw). 


But when m = k = T = h = 1 we get w = k/m = 1, and we saw last time that in this 
special case we get 


5 =1. 


So о must equal е. 
Thus, the entropy of an oscillator with frequency w at temperature T' is 


S = kln(ekT/ħw) = k (m (=) + ) . 


The extra 1 here is fascinating to me. If we had slacked off, ignored the possibility of 
a dimensionless constant o, and crudely used dimensional analysis to guess 5 approxi- 
mately the way people often do, we might have gotten 


S = k In(kT/ħw) 


This would be off by 1 nat. 

What does the 1 extra nat mean? It seems pretty mysterious now. But later we'll 
understand it! I already mentioned that often entropy is roughly k times the logarithm 
of something called the ‘number of accessible states. But that formula is not exactly 
right: there's also an extra term related to energy, and that accounts for the 1 extra 
nat here. Be patient, and you'll see what I mean. 
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WHERE ARE WE NOW? 


The mystery: why does each molecule of hydrogen have ~ 23 bits 
of entropy at standard temperature and pressure? 


The goal: derive and understand the formula for the entropy of a 
classical ideal monatomic gas: 


s=kN [1 Tej kT + 
= ny е a 


including the mysterious constant ^y. 


The subgoal: compute the entropy of a single classical particle in a 
1-dimensional box. 


The sub-subgoal: explain entropy from the ground up, and 
compute the entropy of a classical harmonic oscillator: 


kT 
86 = к |= +1 
un) + 


Okay, so we've gotten somewhere! By doing the right integral, we've figured out that 
the entropy S of a classical harmonic oscillator of frequency w in thermal equilibrium 


at temperature Т is 
kT 
9 = Е |1 +1]. 
А 


where k is Boltzmann's constant and h is Planck's constant. 

We could compute the entropy of a single particle in a box the same way, and also 
the entropy of a classical ideal diatomic gas. But the integrals get a bit hairy, so people 
prefer to use a clever trick called the ‘partition function’. It’s definitely worth learning. 
It's not merely a clever trick, it gives new insights on the relation between entropy, 
energy and temperature. So let's talk about it. 
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THE PARTITION FUNCTION 


If a system has a set of states X 
with measure da and 

its energy is a function E: X — 
its partition function is 


Z(8) = L e PEC) dy 


where 8 is the coolness. 


I want to compute the entropy of a particle in a box, and ultimately the entropy 
of a box of hydrogen. We could do it directly, but that’s a bit ugly. It’s better to 
use the ‘partition function’. This amazing function knows everything about statistical 
mechanics. From it you can get the entropy—and much more! 


64 


THE PARTITION FUNCTION AND THE BOLTZMANN 
DISTRIBUTION 


If a system has a set of states X with measure dx 

and its energy is E: X — К, 

in thermal equilibrium at coolness В its probability distribution of 
states is the Boltzmann distribution: 


е-8Е(а) е-ВЕ(а) 


/хе-#Р® dm 2(8) 


p(x) = 


where 
Z(B) — I. e PEG) da 


is its partition function. 


In fact we've already seen the partition function: it's the thing you have to divide 
e ?F() by to get a function whose integral is 1. And that function whose integral 
is 1 is the Boltzmann distribution: the probability distribution of states in thermal 
equilibrium at coolness 8. So the partition function is a humble normalizing factor! 
And yet we'll see that it’s incredibly powerful. It’s kind of surprising. 

Like Lagrangians in classical mechanics, it’s fairly easy to use partition functions, 
but it's harder to understand what they ‘really mean’. We will try. But first let's see 
how to use them. 
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THE PARTITION FUNCTION KNOWS ALL! 


If a system has partition function 


2(8) = ГА e PEG) да 


then in thermal equilibrium at coolness В its expected energy is 


d 
(Еу = 2 ш 


and its entropy is 


s-k(mz- Br nz) 


Here's how you can compute the expected energy (Е) and the entropy 5 of any 
system starting from its partition function Z(fj) as a function of the coolness 8. ГЇЇ 
show you why these formulas are true, and then we'll test them out on the harmonic 
oscillator, where we have already computed the expected energy and entropy by other 
methods. 
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THE PARTITION FUNCTION KNOWS THE EXPECTED 
ENERGY 


If a system has partition function Z(G) = / e PF» dæ then 
X 


1 
= | E(x) е8) dx 
Z IX 


| E(x) е 8% qa; 
x 
х 


(E) 


d 
In short, the expected energy is (E) — - dB In Z 


The partition function is all-powerful! For starters, if you know the partition func- 
tion of a physical system, you can figure out its expected energy. The expected energy 
(E) is minus the derivative of In Z with respect to the coolness 5 = 1/kT. 

How do we show this? Easy: just look at the calculation above! We get a fraction, 
which is the expected value of E with respect to the Gibbs distribution. 

By the way, this trick of taking the derivative of the logarithm of a function is 
famous: it's called a ‘logarithmic derivative. Notice that 


[PEL 
In f(a) = 7 


dx 
Thus the logarithmic derivative says how fast a function is growing compared to the 
value of the function itself—like the interest rate in compound interest. 
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THE PARTITION FUNCTION KNOWS THE ENTROPY 


If a system has Boltzmann distribution 
e BE) BE 
= h 2 = / сл d 
р(х) where e 2 


then its entropy in thermal equilibrium is 


e BEG) 
—k f pæ) In p(z)da -k [ р(а) »( Г ) ae 


k h p(x) (m Z+ BE(z))dz 
k (in 2+ 8(Е)) 
But since (E) = E In Z, this gives 


s=k(nz-sZ nz) 


The entropy is a bit more complicated. But don’t be scared! The Boltzmann 
distribution p(x) is a fraction, so the log of this fraction breaks into two parts: 


ln Z 


In p(x) = ln | | = —(In Z + BE(x)). 


Thus our integral for entropy breaks into two parts: 
© = -k | pla) ln p(x) dx = k f n) In Z dz + kB | pz) Е(а) dz. 


The first part is just k ln Z since the integral of p(x) is 1. The second part is k8 (EY. If 
we use what we just learned about (Е): 


(E) = -5 nZ 


we get this formula for entropy in terms of the partition function: 


s-k(nz-sr mz). 


This formula seems hard to understand at first. To extract its inner meaning, we need 
a new concept: ‘free energy’. 
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THE PARTITION FUNCTION KNOWS THE FREE ENERGY 


To maximize entropy while holding expected energy constant, 
you can just minimize the free energy 


Е = (E) -TS 


We’ve seen 


(E) = = hg and S=k (mz - 95 mz) 
dp dp 


so with 8 = 1/kT a little algebra shows 


We can understand the relation between entropy, energy and the partition function 
if we bring in a concept I haven’t mentioned yet: the free energy 


F = (E) - TS. 


Since we know formulas for (E) and S in terms of the partition function, we can work 
out a formula for F. And it's really simple! Much simpler than S, for example. It's 
just 
1 
F=-—InZ. 
8 


But what's the meaning of free energy? Remember: to maximize the Shannon 
entropy H subject to a constraint on expected energy, we introduced the Lagrange 
multiplier 8 = 1/kT and maximized the quantity Н — 8(E). But if you multiply this 
quantity by —KT', you get free energy: 


—kT(H — В(Е)) = (E) — TS =F. 


So, as long as T' > 0, maximizing entropy subject to a constraint on expected energy is 
equivalent to minimizing free energy! 

Thus, free energy turns a problem of maximizing entropy subject to a constraint 
into a minimization problem without a constraint. The point is not that we've turned 
maximization into minimization: that's just an arbitrary business with signs. The point 
is that free energy lets us stop thinking about the constraint. 

There's а huge amount to say about the free energy, which is also called the 
‘Helmholtz free energy’, since there are other kinds. You can think of T'S as the amount 
of energy in useless random form, since it comes from entropy. Since (Е) is the total 
expected energy, F = (E) — TS is the amount of ‘useful’ energy. More precisely, the 
free energy is the maximum amount of work obtainable from a system at a constant 
temperature. But showing this would take us out of our way. 
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THE PARTITION FUNCTION KNOWS ALL: REVISITED 
If Z(8) is the partition function of a system, in thermal 
equilibrium at coolness 8 its expected energy is 

d 
(E) = ——lnZ 
ав 
and its free energy is 
1 
Е = ——InZ 
В 


We can compute its entropy from these using 
F-—(E)—TS 


and we get 


s-k(mz- p= mz) 


Now we can tell a simpler story, which is easier to remember. Free energy, being 
the energy in useful form, is the expected energy minus the useless energy, which is 
temperature times entropy. Thus 


SO 


= RB(-F + (E)) 


and using our formulas for F and (E) in terms of the partition function Z, we get 
S—k|lnnZz- 2 InZ 
=k(nZ-6—InZ]. 
dB 


The story here is more of a mnemonic than a true explanation, because I’m not 
saying much what it means for energy to be ‘useful’ or ‘useless’. Гуе only given this 
hint: when a system is in thermal equilibrium, its free energy is minimized. For more 
on the meaning of free energy, try a good book on thermodynamics, like this: 


• Frederick Reif, Fundamentals of Statistical and Thermal Physics, Waveland Press, 
Long Grove, Illinois, 2009. 


Right now I'd rather say a bit about the meaning of the partition function. 
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THE MEANING OF THE PARTITION FUNCTION 


Say X is a set where each point 2 has an ‘energy’ E; Є 
Its partition function is 


Z = 5 e PR: 
iex 


where В c К is the coolness. 


The partition function counts the points of X —but it counts 
points with large energy less, since they're less likely to be 
‘occupied’. 


If 8 = 1/kT, points with energy E; >> kT count for very little. 


But as Т — +00, all points get fully counted and Z — |X|. 


In physics we call Z the number of accessible states. 


Say we have a system with some countable set of states X. In thermal equilibrium 
at temperature Т, the probability that the system is in its 7th state is proportional to 
exp(—(GE;), where E; is the energy of that state and f is the coolness. Thus, physicists 
say the partition function 

Z= 5 p 


iex 
is the number of accessible states: roughly, the number of states the system can easily 
be in at temperature Т, where 8 = 1/kT. 

This is a funny thing to say, because being ‘accessible’ is not a yes-or-no matter. A 
more precise statement is that the partition function counts states weighted by their 
accessibility exp(—/ E;). States whose energy is low compared to kT are highly acces- 
sible, or probable, because exp(—(E;) is close to 1 if E; < kT. States of high energy 
аге more inaccessible, or improbable, since exp( — 8 E;) is close to 0 if E; > kT. 

Calling the partition function the ‘number of accessible states’ emphasizes how it 
generalizes the cardinality |X| of an ordinary set X, meaning its number of points. 
Let's make this precise! Let's call a set X with a function E: X — R an energetic set. 
I will write it merely as X, so you need to remember it comes with an energy function. 
I will call its partition function Z( X): 


Z(X)e V e °®. 


ЄХ 


If X is finite we don't have to worry about the convergence of this sum. My main 
message is this: 


The partition function Z( X) does for energetic sets 
what the cardinality |X| does for sets. 


For example, just like the cardinality, the partition function adds when you take disjoint 
unions, and multiplies when you take products! Let's see why. 
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Puzzle 35. The disjoint union X + X’ of energetic sets E: X — R and Е: X' > R 
is again an energetic set: for points in X we use the energy function E, while for 
points in X’ we use the function E'. Show that the partition function obeys the law 
Z(X + X) = Z(X) + Z(X'), at least for finite energetic sets. 

Puzzle 36. The cartesian product X x X’ of energetic sets E: X > Rand Е: X' > R 
is again an energetic set: define the energy of (x,z') € X x X' to be E(x) + E(x’). 


This is how it really works in physics. Show that the partition function obeys the law 
Z(X x X') = Z(X) Z( X^), at least for finite energetic sets. 


Puzzle 37. Show that if X is a finite energetic set, its partition function Z( X) ap- 
proaches its cardinality |X| as T — +00. 


The key virtue of cardinality is that two sets are isomorphic—that is, there exists 
a one-to-one and onto function between them—if and only if they have the same car- 
dinality. This generalizes to energetic sets if we use the partition function instead of 
the cardinality! Let's say two energetic sets with energy functions E: X — R and 
E': X' — К are isomorphic if there is a one-to-one and onto f: X — X’ which is 
compatible with their energy functions, meaning 


E'(f(x)) = Е(а) 


for all z € X. 


Puzzle 38. Show that two finite energetic sets are isomorphic if and only if they have 
the same partition function. (Hint: the key is to show that the functions exp( — Е/КТ) 
for various energies E € IR are linearly independent. As a step toward this, show that 
a finite linear combination 


>; c; exp( — £;/ kT) 


can only be zero if c; = 0 for the smallest energy E;.) 


If you're into category theory, here are some ways to go further. If you're not, please 
skip to the next page. 
Puzzle 39. Make up a category of energetic sets, where morphism are maps that are 
compatible with their energy functions. Prove that it is a category. 
Puzzle 40. Show the disjoint union of energetic sets is the coproduct in this category. 
Puzzle 41. Show that what I called the cartesian product of energetic sets is not the 
product in this category. 


Puzzle 42. Show that what I called the ‘cartesian product’ of energetic sets gives a 
symmetric monoidal structure on the category of energetic sets. So we should really 
write it as a tensor product X & X’, not X x X”. 


Puzzle 43. Show this tensor product distributes over coproducts: X & (Y + Z) = 
X@Y+X @Z. 


We can go even further and define not only a partition function for energetic sets, 
but also an expected energy, free energy, and entropy, using the formulas we’ve seen 
earlier. These obey a bunch of rules like this: 


Puzzle 44. Define the entropy of an energetic set by 
S(X)=k (-ъдх) + в zen) А 
Show that 
S(X & Y) = S(X) + S(Y). 


72 


ENTROPY COMES IN TWO PARTS 


The entropy of a system in thermal equilibrium 
is always the sum of two parts: 


1. The free energy part: 


F 
—— — klnnZ 
T 


This is Boltzmann's constant times the logarithm of the number 
of accessible states. 


. The expected energy part: 
(E) 
T 


This equals inkT if the system has n degrees of freedom and its 
energy is a positive definite quadratic form. 


Before we dive into examples, it's good to think one last time about the entropy of 
a system in thermal equilibrium. We've seen that this entropy is always the sum of two 
parts, which we could call the free energy part — F/T and the expected energy part 
(E)/T. But there are various ways to think about this. One is simply that it follows 
from F — (E) — TS: the free energy is the expected energy minus the useless energy. 
But here is another way to think about it. 

In his early work, Boltzmann said the entropy of a system is k times the logarithm of 
the number of states it can occupy. This is true if all these states are equally probable. 
But typically some states are more probable than others. We could try to address this 
by replacing the number of states with the number of accessible states 


Z= 5 gr 


ic X 


Here we count states weighted by their accessibility exp(—GE;). If we try to follow 
Boltzmann’s prescription with this adjustment we get kln Z = F/T. This is the free 
energy part of the entropy. 

In many situations this is close to the true entropy. But this clearly can't be all 
there is to it. After all, suppose we add the same constant c to the energy of each state. 
Then the probability of each state in thermal equilibrium is unchanged, so the entropy 
must stay the same! But the accessibility of each state gets multiplied by exp(— дс), 
so we have to subtract kc from the free energy part of the entropy. There must be 
some compensating term—and this is the expected energy part of the entropy, (E)/T. 
When we add c to the energy of each state, this goes up by c/T = kc. 

Thus, in thermal equilibrium we can think of entropy as k times the log of the 
number of accessible states, ‘corrected’ so that the result doesn't change when we add 
a constant to the energy of every state. 
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THE POWER OF THE PARTITION FUNCTION 


A classical harmonic oscillator with mass m and spring constant к 
has energy 


2 2 
p kq 
E(p, q) = 2m s 2 


Its partition function is 
у= lI h 
where 8 is coolness and h is Planck's constant. 


From this we can find its expected energy and free energy in 
thermal equilibrium: 


d 1 
(E) = ——I1nZ Е=——1п7 
ав B 


and then its entropy: 


| (E)- F 
DUM 


S 


where Т is temperature: В = 1/kT where k is Boltzmann's 
constant. 


To test the power of the partition function, let's use it to figure out the entropy of 
a classical harmonic oscillator. Here's the game plan. First we'll compute the partition 
function by doing the integral in bright red. Then we'll use it to compute the oscillator's 
expected energy and free energy. Then we'll subtract those and divide by temperature 
to get the entropy. 

In fact, we've already worked out the answer to this problem: 


TOR] 


Our earlier approach led to some cool insights. But it was ‘tricky’, not systematic. The 
partition function method is systematic, so it’s good for harder problems. It will also 
give new insight into that pesky +1. 

When we compute the entropy using a partition function, all the pain is concentrated 
at one point: computing the partition function! So let’s get that over with. 


74 


HARMONIC OSCILLATOR: PARTITION FUNCTION 


A classical harmonic oscillator has energy E(p,q) = P + sp 
and frequency w = ,/K/m, so its partition function is 


= po -o(B +f) ара 
FF e. t == 
e ee (== о, y= Vea) 


2т 
em A e 9"? rdr dg (switching to polar) 


28 f "e^?" du (u = r?/2) 


1 


For the harmonic oscillator, the partition function is the integral of a Gaussian in 
two variables. A change of variables makes the Gaussian ‘round’, and then we use polar 
coordinates to do the integral. 

The physicist Kelvin is said to have written 


T e`% dr = ут 


on the blackboard and said “A mathematician is one to whom that is as obvious as 
that twice two makes four is to you.” I find that rather obnoxious, but when I heard 
the story as a kid, I made damn sure I knew how to do this integral. The usual trick is 
to compute the square of this integral using polar coordinates. 

Now we're seeing something interesting. The harmonic oscillator, whose energy 
depends quadratically on two degrees of freedom, is physically more important than a 
system whose energy depends quadratically on just one degree of freedom. And when 
B = h = ш = 1, the partition function of the harmonic oscillator is 


(: 24y? 2m foo со 
| fie ? la dy = a} n e" Prdr 40 = an | е "du = 2r, 
o Jo 0 


which is more fundamental than the integral Kelvin wrote down. 
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HARMONIC OSCILLATOR: EXPECTED ENERGY 


A classical harmonic oscillator has partition function 


1 
Ze 
Bhw 


so its expected energy in thermal equilibrium is 


d 1 
(Е) =-—InZ =— 
dp B 


(E) = kT 


just as the equipartition theorem says it must be! 


Once we know the partition function of the classical harmonic oscillator, it's easy 
to compute its expected energy: just use 


and get 


We can also figure this out using the equipartition theorem. Remember, the equiparti- 
tion theorem applies to a classical system whose energy is quadratic. If it has n degrees 
of freedom, then at temperature T' it has 


(E) = 5T. 


Our harmonic oscillator has n = 2, so we get (E) = kT. Good, this matches the 
partition function approach! 
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HARMONIC OSCILLATOR: FREE ENERGY 


A classical harmonic oscillator has partition function 


1 


Z(8) = 3 


so its free energy in thermal equilibrium is 


1 1 
F = —— ln Z = ——I1n 
8 


The partition function lets us do more! It lets us compute the free energy, too, using 


1 
F=-—|nZ 
8 


Unlike the expected energy, the free energy involves Planck's constant: 
kT 
F= —kT lh | — |]. 
(ж) 


Note kT and hw both have units of energy, so kT/ħw is dimensionless, which is good 
because we're taking its logarithm. Also note that the free energy is negative at high 
temperatures! That may seem weird, but it turns out to be good when we compute the 
entropy. 
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HARMONIC OSCILLATOR: ENTROPY 


In thermal equilibrium at temperature T', 
a classical harmonic oscillator has 
КТ 


expected energy (E) = kT and free energy F = —kT ln (=) 
Ww 


so its entropy is 


To compute the entropy of a classical harmonic oscillator, we just use 


We get the answer we got before, of course: 


Ыш) 9) 


But now we can finally understand the puzzling extra +1. 
As we've seen, the entropy of any system in thermal equilibrium consists of two 
parts: 


1. The free energy part, —F/T. For the classical harmonic oscillator this is 
F kT 
—— = kln| — |. 


2. The expected energy part, (E)/T. For the classical harmonic oscillator this is 


The free energy part of the entropy is always k times the logarithm of the number 
of accessible states. For the classical harmonic oscillator, the expected energy part of 
the entropy must equal k by the equipartition theorem, since the oscillator's energy 
depends on 2 degrees of freedom. This is small compared to the free energy part when 
hu < kT: that is, when quantum effects are small compared to thermal effects. 
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PARTICLE IN A BOX: PARTITION FUNCTION 


The energy of a classical free particle of mass m 
in a 1-dimensional box depends only on its momentum р: 


2 
p 
E(p, q) — ur 


2m 
Its position q is trapped in the interval [0, L]. 


Its partition function is therefore 


І гоо ар а L гоо 2 1, 
Z(B) zi I. e PE») Pa — zi e PP /2m dp = h 


Now let's turn to our ultimate goal: computing the entropy of a box of gas. As a 
warmup, let's figure out the entropy of a single particle in a box. In fact, let's start 
with a a free classical particle in a one-dimensional box: that is, in some interval [0, L]. 

The first step is to compute its partition function. As you can see, this is easy 
enough. But the whole idea raises some questions. Some people get freaked out by the 
concept of entropy for a single particle—I guess because it involves probability theory 
for a single particle, and they think probability only applies to large numbers of things. 

I sometimes ask these people *how large counts as large?" In fact the foundations 
of probability theory are just as mysterious for large numbers of things as for just one 
thing. What do probabilities really mean? We could argue about this all day: Bayesian 
versus. frequentist interpretations of probability, etc. I said a tiny bit about this before, 
and I won't say more now. 

Large numbers of things tend to make large deviations less likely. For example the 
chance of having all the gas atoms in a box all on the left side is less if you have 1000 
atoms than if you have just 2. This makes us worry less about using averages and 
probability. 

But the math of probability works the same for small numbers of particles—even 
one particle! Even better, knowing the entropy of one particle in a box will help us 
understand the entropy of a million particles in a box—at least if they don't interact, 
as we assume for an ‘ideal gas’. 

But why just a one-dimensional box? The answer is that particle in a 3-dimensional 
box is mathematically the same as 3 noninteracting distinguishable particles in a one- 
dimensional box! The z,y, and z coordinates of the 3d particle act like positions of 
three 1d particles. 
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PARTICLE IN A BOX: EXPECTED ENERGY 


A classical free particle of mass m in a 1d box of length L 
has partition function 


27m 


8 


The expected energy of any system in 
thermal equilibrium is 


d 
(E) — -p 


So, by the miracle of basic calculus, we get 


1 1 
— kT 


(E) = 357 5 


as we'd expect from the equipartition theorem! 


We worked out the partition function of a classical free particle in a 1-dimensional 
box. From this we can work out its expected energy. Look how simple it is! It’s just 
IkT , where k is Boltzmann’s constant and T' is the temperature! 

Why is the final answer so simple? We can use the chain rule 


ай = A 
df Z dB 
to see that only the power of 2 in 
L [2mm 
Nus сш 
hy B 


matters, not all the constants: these constants show up in 47/48, but also in 1/Z, and 
they cancel. The length L, the mass m, Planck's constant Л, the factor of 27... none 
of this junk matters! Not for the expected energy, anyway. Because Z is proportional 
to 871/2, we simply get (E) = АТ. 

More generally, if the partition function of a system is proportional to 87©, its 
expected energy will be ckT: 


Ze) == (Е) ENT. 


But when is the partition function of a system proportional to 27°? It's enough for 
the system's energy to be a positive definite quadratic form in n real variables—which 
physicists call ‘degrees of freedom’. Then c = n/2. We've already seen an example with 
2 degrees of freedom: the classical harmonic oscillator. We saw that in this example 
Z c 1/B. This gives (E) = KT. But the result is quite general: 
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Puzzle 45. Suppose we have a system with state space К” and energy function E: R” — 
R that is a positive definite quadratic form, so that 


1 
E(z) = |А? 


for some invertible n x n matrix А. Show that its partition function is proportional to 
p^ * where с = n/2. 


In fact, this is just a new outlook on our friend the equipartition theorem. 

Here's another thing to consider. While our particle in a 1d box has 2 degrees 
of freedom— position and momentum-—its energy depends on just one of these, and 
quadratically on that one. So its expected energy is InkT where n — 1, not n — 2. 

5o here's another puzzle for you: 


Puzzle 46. Say we have a harmonic oscillator with spring constant к. As long as & > 0, 
the energy depends quadratically on 2 degrees of freedom so (E) = kT. But when 
к = 0 it depends on just one, and suddenly (Е) = КТ . How is such a discontinuity 
possible? In other words: how can a particle care so much about the difference between 
an arbitrarily small positive spring constant and a spring constant that's exactly zero, 


making its expected energy twice as much in the first case? 


ГЇЇ warn you: this puzzle is deliberately devilish. In a way it’s a trick question! 
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PARTICLE IN A BOX: FREE ENERGY 


A classical free particle of mass тт in a 1d box of length L 
has partition function 


L 27m 
Ж —— eI 
hN B 


1 
The free energy of any system is given by F — ^8 In Z, so 


L | 2пт, 
Е = ——1n 
B hN B 


Using В = 1/kT and fiddling around a bit, 


we can rewrite this as 


Bcc EP d aai 
2 2 а АР 


From the partition function of a classical free particle in a one-dimensional box we 
can also compute its free energy! 
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PARTICLE IN A BOX: ENTROPY 


We've shown that in thermal equilibrium, a classical particle of 
mass m in a 1-dimensional box of length L has expected energy 


(E) — „ЕТ 


and free energy 


1 1 2*m 
Е = —kT ш» inq Ai 


But entropy S is always ((E) — F)/T, so 


1 1 
S - (mz e Fink ую 


Having worked out the expected energy (Е) and free energy F for a single classical 
particle in thermal equilibrium in a 1-dimensional box, it is easy to work out its entropy. 
We just subtract the free energy from the expected energy and divide by temperature: 


нй А 
6 = —_. 
Т 
The formula we get is not very snappy: 
1 1, 27m 1 
g= JU тат + m + =) 


We will get a better formula later, and ponder its meaning. For now, let's just make 
these observations: 


• When we make the length L of the box larger, the entropy becomes larger. 
• When we increase the temperature Т, the entropy becomes larger. 
• When we increase the mass m of the particle, the entropy becomes larger. 


The first two facts should feel intuitively obvious. When we increase the box's 
length, there is more unknown information about the position of the particle in thermal 
equilibrium. When we increase the particle's temperature, there is more unknown 
information about its momentum. The third fact is less obvious. When we introduce the 
concept of ‘thermal wavelength’, we will see that increasing the particle's mass decreases 
its thermal wavelength, which in turn increases its entropy in thermal equilibrium. 
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WHERE ARE WE NOW? 


The mystery: why does each molecule of hydrogen have ~ 23 bits 
of entropy at standard temperature and pressure? 


The goal: derive and understand the formula for the entropy of a 
classical ideal monatomic gas: 


Bis Rr а de ia 
= n— + —In 
N? Y 


including the mysterious constant ^y. 


The subgoal: compute the entropy of a single classical 
particle in a 1-dimensional box: 


Sod mic*mtPg m 7 
и г" а" 2 А2 2 


The sub-subgoal: explain entropy from the ground up, and 
compute the entropy of a classical harmonic oscillator: 


kT 
s=k (n= +1) тА 
hw 


Let’s pause to remember where we are in our game plan. First we computed the 
entropy of a classical harmonic oscillator. Now we’ve computed the entropy of a single 
classical particle in a 1-dimensional box. The answer looks a bit like the entropy of an 
ideal gas! That’s no coincidence—we’re almost there now. 

In case you wanted to know the entropy of a particle in a 3-dimensional box, don’t 
worry. It’s the same as the entropy of three particles of the same mass in three 1- 
dimensional boxes of appropriate lengths: the length L, width W and height H of our 3d 
box. So we can just sum those З entropies and get our answer. Since In L--In W +ln H = 
In V where V is the volume of our 3d box, we get 


= 3 _3, 2mm | 5) 
б=к(ву+уюш®т+ уш” 5). 
Later we'll do this calculation more rigorously and more generally for a box of any 
shape. 

But you may have another question: what's the meaning of our formula for the 
entropy of a classical particle in a 1-dimensional box? It's pretty complicated, after all, 
and we'll need to understand it to have any chance of understanding the mysterious 
constant y in the formula for a classical ideal monatomic gas. 

We can understand our formula better if we delve into a tiny bit of quantum me- 
chanics, and the concept of ‘thermal wavelength’. So let's do that. 
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THE WAVELENGTH OF A PARTICLE 


In quantum mechanics particles are waves! 
A particle with momentum p has wavelength 


А = – 
р 


where h is the unreduced Planck’s constant, exactly 


6.62607015 · 10 ?^ joule-seconds 


For example, the wavelength of an electron moving at 
1 meter/second is about 0.7 millimeters. 


One of the most amazing discoveries of 20th-century physics: particles are waves. 
The wavelength of a particle is Planck's constant divided by its momentum! This was 
first realized by Louis de Broglie in his 1924 Ph.D. thesis, so it’s called the ‘de Broglie 
wavelength’. 

Why am I telling you this? Because I want to explain and simplify the formula 
for the entropy of a particle in a box. Even though I derived it classically, it contains 
Planck’s constant! So, it will become more intuitive if we think a tiny bit about quantum 
mechanics. 

A good explanation of quantum mechanics would require a whole other course. But 
it’s good to know that in quantum mechanics, a particle with a given momentum has 
a wavelength associated to it: we shouldn’t imagine it as having a definite location; it’s 
a bit ‘blurry’. 

This will give a more intuitive explanation for our complicated formula of the entropy 
of a particle in a 1d box. We'll use this intuition to simplify our formula. That will 
make it easier to generalize to N particles in a 3d box—that is, a classical monatomic 
ideal gas! 
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THE WAVELENGTH OF А WARM PARTICLE 


In thermal equilibrium, the average energy of 
a classical free particle in 3d space is 


(Е) = Sar 


where T is the temperature and k is Boltzmann’s constant. 


If the particle has mass m, 


1 
E= gs р= то = p=V2ME = v3mkT 


In quantum mechanics, a particle of momentum p has 
wavelength А = h/p where h is the unreduced Planck’s constant. 


So, at temperature T', the typical wavelength of 
a free particle of mass m is roughly 


Particles are waves! Their wavelength is shorter when their momentum is bigger. 
And the warmer they are, the bigger their momentum tends to be. So there should 
be a formula for the typical wavelength of a warm particle. And here it is! It helps 
us visualize the world: particles are a bit blurry, with a characteristic wavelength that 
depends on temperature. 

We get this formula from a blend of ideas. Classical mechanics says kinetic energy 
is E = p*/2m. Classical statistical mechanics says (E) = ЗАТ. Quantum mechanics 
says А = h/p. It’s pretty optimistic to put these formulas together and see what we 
get. But the result is approximately correct, though subject to limitations. 

We derived (E) = SkT using classical statistical mechanics. But it's close to cor- 
rect for a single quantum particle in a big enough box at high enough temperatures. 
Otherwise quantum effects kick in. 

Another problem is that (E) = 3kT and E = p?/2m do not imply (p) = V3mkT, 
even if p here means the magnitude of the momentum vector. The arithmetic mean 
of a square is not the square of the arithmetic mean! Really the ‘root mean square’ 
of p is V/3mkT. Similarly, even if the root mean square of p is V 3mKkT and quantum 
mechanically А = h/p, we cannot conclude that the root mean square of А is h/V3mkT. 
Again, you cannot pass a root mean square through a reciprocal! 

So, our derivation above is dodgy—but it’s okay as an order-of-magnitude approxi- 


mation for a warm enough particle in a big enough box. 
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THE PARTITION FUNCTION AND THE THERMAL 
WAVELENGTH 


The partition function of a classical free particle 
of mass тт in a 1d box of length L is 


Z L MN c-r fam OP 09 
0 —со h 


27m 


27m 


is called the ‘thermal wavelength’. 


Last time we saw that at temperature T', the typical wavelength of a free particle 
of mass m is roughly 


h . 
3mkT 3m 


But the partition function of a classical particle of mass m in a box simplifies a lot if we 
introduce a slightly different distance scale, which people call the thermal wavelength 


A= n —h B 
© VonmkT — 2тт ` 


Then the partition function is just the length of the box divided by A. The thermal 
wavelength A is a bit smaller than А: we have А e 0.694. But we probably shouldn't 
worry about this too much, since our calculation of А was so rough. Of course all these 
details are worth thinking about. But the thermal wavelength will turn out to be very 
useful! 
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FREE ENERGY AND THE THERMAL WAVELENGTH 


In thermal equilibrium, a classical free particle 
of mass m in a 1d box of length L has free energy 


2ттп, 
\ 8 


Г, 
Е = —kT ln — 
A 


where 


i P 
27m 


is the thermal wavelength. 


Since the partition function of the classical free particle in a one-dimensional box is 


1 
Е = – = 1а 7, 
8 
we have P 
Е = = | =. 
ВА 
Expressing this in terms of temperature rather than coolness, we have 
L 
Е = —kT ln —. 
"A 
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ENTROPY AND THE THERMAL WAVELENGTH 


In thermal equilibrium, a classical free particle 
of mass 1n in a 1d box of length L has expected energy 


1 
(Е) = KT 


and free energy 


L 
Е = —kT ln — 
A 


where A = h/v 2nmkT is the thermal wavelength. 


But entropy S is ((E) — F)/T, so 


Now that we have clean formulas for the expected energy and free energy of the 
classical free particle in a 1-dimensional box, we can get a nice formula for its entropy. 
This is equivalent to the formula we saw before, but it's easier to understand. It'sa 


sum of two terms: 
L 1 
Bus р 
(mi ш; 


Let's make sure we understand this! We've seen that for any system in thermal 
equilibrium, the entropy is the sum of two parts: 


1. The free energy part. For the classical particle in a 1-dimensional box, this is 


2. The expected energy part. For the classical particle in a 1-dimensional box, this 
is 


The free energy part is always k times the logarithm of the number of accessible states, 
and for the particle in a one-dimensional box the number of accessible states is L/A. 
The expected energy part is Ik, by the equipartition theorem, because the particle's 
expected energy depends on one degree of freedom. 

Let us think a bit more about why the number of accessible states is L/A. The 
most rigorous approach is simply to compute the number of accessible states—that is, 
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the partition function: 


y s [ [ie —EJ/kT Фе dq 
a I. e 8» /2т dp dq 
0 —oo h 


А more hand-wavy approach is to imagine the space of states of the particle, meaning 
the space of position-momentum pairs (q, p) € [0, L] х R. When it comes to counting 
accessible states, each region of area h holds one state. The ‘accessible’ states are those 
where the energy is not too big compared to ЁТ, so the probability density e~£/* is 
fairly large. This is a bit vague, as it must be, because ‘accessibility’ is not really a yes- 
or-no matter. But let's just pretend it is, and demand E < kT. Then the ‘accessible’ 
region of state space is where p?/2m < КТ, or 


Ip] € y2m/8. 


This region is 


ар) | 0<@<1,—\у2т/8 <р < 2/8} C [0, L] x 


It has area L x 24/2m/D, so the number of states it holds is this divided by h, or 


2L ү ТА L 

h\V 8 Vara 
This is just 13% more than the exact value of Z. More importantly, I hope this cal- 
culation gives you a mental picture of number of accessible states for the particle in 
a one-dimensional box. A mental picture can be helpful even if it’s oversimplified. I 
like to imagine counting the little rectangles of area h that can fit into the ‘accessible’ 
region of state space. 

In fact this idea is related to Bohr and Sommerfeld’s early approach to quantum 
physics, the ‘old quantum theory’, which was later subsumed by the theory of ‘geometric 
quantization’. In Bohr and Sommerfeld’s approach, when we quantize a classical system 
with one position and one momentum degree of freedom, there should be approximately 
one quantum state for each region of area h in the gp plane. More generally, when we 
quantize a classical system with n position and n momentum degrees of freedom, there 
should be approximately one quantum state for each region R C R?" with 


d"p d"q 
= 1. 
ie 


So, delving into more quantum mechanics, and geometric quantization, would shed 
more light on the calculations we’re doing now. 


29 
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PARTICLE IN A 3D BOX: PARTITION FUNCTION 


The partition function of a classical free particle of mass rn 
in a 3d box B of volume V is 


2 = l n e BEB/2m d'pd'q _ V (27т i 
B вз h? h3 B 


where 8 = 1/kT is the coolness. 


This result becomes prettier using the thermal wavelength 


A = h(B/21:m)!? 


Then we get simply 


Now that we’ve worked out the statistical mechanics of a classical particle in a one- 
dimensional box, it’s easy to copy everything for a three-dimensional box of any shape. 
We start with the partition function. The energy of a free particle of mass m is р. р/2т, 
so the partition function is the integral of exp(—p · p/2m) over all possible positions 
and momenta. Integrate over momentum and you get 


3 3 
| є—В(рї+рї+әф)/2т Übidpadps _ F grim (y, |н 
R3 h3 —oo h B 


In terms of the thermal wavelength this is just 1/A?. Integrate over position and you 
multiply this by the volume of the box, say V. So we get an incredibly simple final 
answer: 


V 
And this sort of calculation works in any dimension: there's nothing special about the 
number 3 here. 
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PARTICLE IN A 3D BOX: ENTROPY 


In thermal equilibrium, a classical free particle of mass m 
in a 3d box of volume V has expected energy 


(E) — SAT 


and free energy 


V 
F = —kT In — 
A3 


where А = h/v 27тКЕТ is the thermal wavelength. 


But entropy S is ((Е) — F)/T, so 


Skil id "n. 
= n — 2 
АЗ 2 


The entropy of a particle in thermal equilibrium in a three-dimensional box works 
very much like our earlier calculation for a one-dimensional box, with a couple of ad- 
justments due to the dimension. Since the particle's energy is now a quadratic function 
of 3 variables, the equipartition theorem now says its expected energy is 


(E) — Sar. 


We can work out its free energy from its partition function, which we computed in the 


last tweet: V 
F = —kT ln Z = =k In 4s. 


Thus its entropy is 
ga ST oa (int +5). 
Т АЗ 2 

The meaning of the two terms here is very similar to that for the particle in the 
one-dimensional box. The first term is k times the logarithm of the number of accessible 
states, as always for the Gibbs entropy of a system in thermal equilibrium. Here the 
number of accessible states is V/A?. The second term is 3k thanks to the equipartition 
theorem, since the particle's expected energy depends quadratically on 3 degrees of 
freedom. When V > A? this second term is a small correction to the first. As this 
ceases to be true, the second term becomes more important—and when A? is comparable 
to V, quantum corrections to our calculation also become significant. 
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A TALE OF TWO GASES 


The entropy of an ideal gas of N distinguishable 
classical particles of mass тт in a box of volume V is 


27m 
h2 


3 3 
Sa = kN (mv ^ se 2 In 
while for indistinguishable particles it’s 


27m 5 
h? 2 


S kN |1 Eu pui 
i ^ — mE n 
"atat 2 


where the corrections are small compared to N as N — оо. 


Now we are finally ready to tackle the entropy of a gas. We start with a ‘monatomic 
ideal gas’, which means № free point particles bouncing around in a box. But there's а 
subtlety! We'll get different answers depending on whether we think of these particles 
as distinguishable or indistinguishable. That is: do we count the state of the gas as 
different if we switch two particles, or not? 

The formulas look very similar. There are three differences: 


• For distinguishable particles we'll get an exact formula, while for indistinguishable 
particles we'll get an approximate one, where the corrections are small compared 
to N when N become large. 


• The entropy for distinguishable particles has a term equal to SkN , while for 
indistinguishable particles it has a term equal to 2kN А 


• Most importantly, there's a huge difference in the volume dependence! Where the 
distinguishable particles have a term in the entropy equal to kN In V, the indis- 
tinguishable ones have a term equal to kN In x so their entropy is considerably 
smaller for large volumes. 


The last difference makes the entropy behave strangely for distinguishable particles, 
so in practice the physically important case is the gas of indistinguishable particles. But 
we'll do the calculations in both cases, because the distinguishable case is easier, and 
interesting. 
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GAS OF DISTINGUISHABLE PARTICLES: PARTITION 
FUNCTION 


The partition function of an ideal gas of N distinguishable 
classical particles of mass m in a 3d box B of volume V is 


АЗУ 


where A = Һ(3/2ттт)!/? is the thermal wavelength. 


Suppose we have a system of N distinguishable classical free particles in a three- 
dimensional box В of volume V. The state of this system is described by N positions 
di...,Qv € В and N momenta pi,...,pw € В. If each particle has mass m, the 
energy of the ith particle is equal to 


Е; = Di Di Di 
2m 
and the energy of the system is 
N 
i=1 


Let's call the partition function of this system Za. To compute this we integrate 
exp(— E) over the space of states, obtaining 


арі -— d?pn аф - d?qw 
Za= A = ехр(—8Е) LN ; 


Above, I proceeded to compute Z, directly by doing the Gaussian integral over 
momenta and integrating each position over the box. Here's a slightly different way. 
Because 

exp(- BE) = exp( BE) ---exp( - Ex), 


the partition function Z4 is a product of integrals which are all equal: 


Z- (I, [em d'pd Ceda) 
R 
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The integral in the parentheses is the partition function of an single particle in а box. 
We have already seen that this equals 


J J е—8Ёр/?т d?p d?q = 
в JR3 h3 АЗ 


where А is the thermal wavelength. Thus we have 


V N 
We can also do this calculation with a lot less work using Puzzle 36. This implies 
that when we build a new system from N identical noninteracting copies of some old 


system, the partition function of the new system is the Nth power of the partition 
function of the old system. What I just did is show this in a special case. 
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GAS OF DISTINGUISHABLE PARTICLES: ENTROPY 


In thermal equilibrium, an ideal gas of /N distinguishable classical 
particles of mass m in a 3-dimensional box of volume V has 
expected energy 


3 
(Eq) = ZENT 


and free energy 


N 


Fu = —kT ln 
АЗУ 


where А = h/v 2nmkT is the thermal wavelength. 


Its entropy Sq is ((E4) — Fa)/T, so 


We use the subscript d for a gas of N distinguishable particles. Since the energy is a 
quadratic function of 3N variables, the equipartition theorem says the expected energy 
is 

3 
(Ea) = ш Т. 
The free energy Ё' is minus Boltzmann's constant times the logarithm of the partition 
function, which we just computed: 
VN 


Thus the entropy of the gas is 
Sy = REL kN (шү, +5). 


If we expand this out using 


A= | 
21mkT 
we get the formula I promised earlier: 
3 3, 2am 3 
S = kN (V + ЕТ + Sin" + >). 


The only advantage of this messier formula is that it separates out the temperature 
dependence and the volume dependence. 
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THE GIBBS “PARADOX” 


For the ideal gas of N distinguishable classical particles in a box 
of volume V, the entropy 


3 3 27m 
Sa = kN I V qe ABT M Ai ua 


more than doubles if we double both N and V 
while keeping everything else the same. 
This confused people for a while, 
so it’s called the ‘Gibbs paradox’. 


Start with a box B containing an ideal gas of distinguishable classical particles. 
Then double the volume of the box to get a new box B’, and double the number of 
particles in the box too, while keeping the temperature and everything else the same. 

We might expect the entropy to double. After all, we could take the doubled box 
and slip a thin wall down the middle to get two identical copies of the original box. So 
the entropy should be twice as big now. Right? 

Apparently not! Instead of just doubling the kN In V term in the original entropy, 
we are replacing it with 2kN In(2V), which is more than twice as big. The reason is 
that in the doubled box B’ each individual particle has twice as much room to roam 
than if you put a wall down the middle. Thus, it takes more information to say where 
all the particles are. 

While there's no real paradox here, people found this result deeply counterintuitive, 
so they called it the *Gibbs paradox. And in fact they had a good reason for being 
suspicious of this result. It would be correct if gas molecules were distinguishable. But 
in fact molecules of the same kind are not distinguishable—they don't have little labels 
on them that let you recognize which is which. And if we take this fact into account, 
our formula for the entropy changes. Let's see how! 


97 


GAS OF INDISTINGUISHABLE PARTICLES: PARTITION 
FUNCTION 


The partition function of an ideal gas of N indistinguishable 
classical particles of mass тт in a 3d box B of volume V is 


Za(B) 


Zi(B) NI 


N! h3N 


1 VN (mmy 


B 


Thus 


p Ye 


Zi(B) = NI ABN 


where A = h(B/27m)!? is the thermal wavelength. 


The partition function 2; for a gas of N indistinguishable particles is 1/N! times 
that for a gas of distinguishable particles. Why? We got Za by integrating exp( —8 E) 
over the space of ordered N-tuples of position-momentum pairs. The energy E here 
does not change if we permute our N-tuple, so we can also think of it as a function 
of unordered N-tuples. Then we get Z; by integrating exp(—GE) over the space of 
such unordered tuples. Notice that there are N! ordered N-tuples for each unordered 
N-tuple, except for N-tuple with repeated entries, which form a set of measure zero 
and thus contribute nothing to the integral. Thus, we should not be surprised that 


Za(B 
But we've seen 
VN 
Za(B) = ABN 
where A is the thermal wavelength, so 
1 V“ 
ZÁB) = xi зу: 


Making this sketchy argument precise requires more notation. I think carefully doing 
the case N — 2 is the best way for you to see what's going on. 
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GAS OF INDISTINGUISHABLE PARTICLES: ENTROPY 


In thermal equilibrium, an ideal gas of /N indistinguishable 
classical particles of mass тт in a 3-dimensional box of volume V 
has expected energy 


3 
(E) = ZENT 


and free energy 


1 VN 
F; = —kT ln | — 
INT ASN 


where A = h/4/2x mkT is the thermal wavelength. 
Its entropy S; is ((E;) — F;)/T, so 


V 3 i 


We use the subscript i for a gas of N distinguishable particles. Since the energy is a 
quadratic function of the 3N momentum variables, the equipartition theorem says the 
expected energy of this gas is 


(Ei) = SANT. 


The free energy Ё' is minus Boltzmann’s constant times the logarithm of the partition 
function, which we just computed: 


1 v` 


Thus the entropy of the gas is 


(Ei) -H V 3 
ас (= + >) — kln N! 
In short, it is kIn N! less than for the gas of distinguishable particles. This makes 
beautiful intuitive sense: there are N! permutations of the particles that we no longer 
care about in the indistinguishable case. 
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STIRLING'S FORMULA 


Stirling's formula says 


N! ~ Ут (Ny 


e 
and it gives 
InN! = (InN —1)N + iln2zN 


where the error goes to zero as N — +оо. 


Now we need a bit of math: Stirling's formula for the factorial function. In one 
form this says 
N 
у27 № (5) 
иш .——————— =]. 
N-—-roo N! 
We abbreviate this fact, that the ratio of two quantities approaches 1 as № — +оо, by 


N 
saying №! is asymptotic to v2r N (&) . We also write 
N N 
ibus v2 (=) | 
е 


where the symbol ~ means ‘asymptotic to’. 
If we take the logarithm of both sides we get 


InN! © (nN — 2)N + In2zN. 


The symbol ~ has a vaguer meaning: ‘approximately equal to’. But it turns out that in 
this instance the approximation is extremely good: the difference between the left and 
right sides goes to zero as № — +оо. In fact we will content ourselves with a cruder 
approximation: 

InN! = (InN—1)N 


because in the entropy entropy of an ideal gas N is typically huge, so the term we have 
discarded here is dwarfed by the others. 


Puzzle 47. Suppose N is Avogadro’s number, close to the number of atoms in 4 grams 
of helium: 
N = 6-10. 


What is the ratio of iln 2n N to N? 


While deriving Stirling's formula is fascinating and not at all trivial, doing so would 
take us rather far afield. So I will resist, and refer you instead to this: 


e John C. Baez, Stirling's formula, The n-Category Café, October 24, 2021. 
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THE SACKUR-TETRODE EQUATION 


In thermal equilibrium, an ideal gas of /N indistinguishable 
classical particles in a 3-dimensional box of volume V has entropy 


V 3 i 


where A = h/v 2nmkT is the thermal wavelength. 
Using Stirling’s formula 
InN! x (In.N —1)N 


we get the Sackur—Tetrode equation: 


Taking our formula 


V 3 
8, = kN (In s j 5) — kin N! 
and using a simple version of Stirling’s formula, In N! ~ (In N — 1)N, we get the famous 
Sackur-Tetrode equation: 


V 3 
V 5 
~ RN (Ims +2). 


Note that with this formula, if we multiply both V and N by the same constant, the 
entropy also gets multiplied by that constant. In this situation we say the entropy is 
‘extensive’. 

For a better approximation, we can use 


InN! z (nN-—1)N+$ln20N 


where the error goes to zero as N — oo. This gives a correction to the Sackur—Tetrode 
equation: 


V 5 1 

S e kN (1 | )- In 2r N. 
NM 3-2 

Here if we multiply both V and N by a constant c, we don’t just multiply the entropy 

by c: we also have to subtract i In 2zc. So the entropy is not quite extensive—but this 


effect is tiny when you've got a mole of gas. 
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THE ENTROPY ОЕ AN IDEAL MONATOMIC GAS 


In thermal equilibrium, an ideal gas of /N indistinguishable 
classical particles in a 3-dimensional box of volume V has entropy 
given approximately by the Sackur- Tetrode equation: 


But the thermal wavelength A is 


h 
А = 


so we can rewrite this as 


27m 
h2 


kN |1 T aci kT 4-31 
п — + – 1а п 
М 2 2 


We've done it: we've figured out the entropy of a gas of N indistinguishable classical 
free particles in a 3-dimensional box of volume V. Above Гуе written it in two different 
ways. Let's mull over the meaning of each term in each formula. 

The first formula says 


V 5 
S c kN ( ) i 
^ NA 2 
Like the entropy of the classical harmonic oscillator and the classical free particle in a 
box, this breaks up into two parts, thanks to the formula 


S 


But it does so a bit subtly. The two parts are not what you might naively think! They 
are: 


1. The free energy part: 


2. The expected energy part: 


As usual, the free energy part of the entropy is k times the logarithm of the number 
of accessible states. The expected energy part of the entropy is ү times k by Ше 
equipartition theorem, since there are N particles each of whose energy depends on 3 
momentum degrees of freedom. 
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The expected energy part of the entropy is small compared to the free energy part 
when V/N > АЗ: that is, when the volume available per particle greatly exceeds the 
cube of its thermal wavelength. This happens for a gas that is sufficiently warm and 
dilute, made of sufficiently massive particles. We will see that this is true for helium 
at standard temperature and pressure. It's even more true for the heavier monatomic 
gases: the noble gases like neon, argon, and krypton. 

The surprise is the extra “+1” in the first part of the entropy—the free energy part. 
It's telling us that the logarithm of the number of accessible states, divided by the 
number of particles, is 

Inl 
nA ; 
What’s the physical origin of this mysterious extra nat? 

Mathematically it comes from Stirling’s formula, which showed up when we switched 
from a gas of distinguishable particles to a gas of indistinguishable particles. It may 
seem odd that indistinguishability would increase the entropy by 1 nat per particle, 
but don't be confused: as we've seen, it greatly reduces it. For a gas of distinguishable 
particles the log of the number of accessible states, divided by the number of particles, 
is In(V/A?). When we switch to indistinguishable particles this drops to In(V/N A?) +1. 

Here is a rough heuristic explanation of what's going on. For a single particle in a 
box of volume V, the number of accessible states is V/A?. In a gas of distinguishable 
free particles, each roams independently around the whole volume V. Thus, the log of 
the number of accessible states is In(V/A?) per particle. 

For a gas of indistinguishable particles, the story changes. For starters, we can 
crudely pretend each particle is trapped in its own tiny box of volume V/N. After all, 
if it leaves this tiny box by trading places with another particle in another tiny box, 
nothing really changes. In this approximation, the log of the number of accessible states 
is In(V/N A?) per particle. 

But it's not really true that each particle can only leave its tiny box by trading places 
with another. We can have more than one particle in the same tiny box—or none. That 
is, our gas can have density fluctuations. An exact treatment of the problem gives, not 
In(V/N A?) nats per particle, but 


In(V/A?) — In N! 
Stirling's formula says this is approximately 
In(V/A?) -(InN 21) = In(V/NAS) +1. 


This explains the mysterious extra nat. The extra nat of entropy per particle is due to 
density fluctuations! 

As we've seen, even this is an oversimplification. A still better approximation, again 
coming from Stirling's formula, says 


In(V/A?) -In N! = In(V/NA*) +1 – 1 In(2xN)/N. 


But as we saw in Puzzle 47, this further correction is negligible for a mole of gas. It 
only becomes interesting for microscopic systems. 

Now let's look at our second formula for the entropy of a gas of N indistinguishable 
classical free particles: 


3 3. 2mm ai >) 
h2 2 


- 
S & kN (In e 5 In KT +510 
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Not only is this harder to remember, it's generally less friendly to physical intuition. 
First of all, three of the terms involve the logarithm of dimensionful quantities. Thus, 
when we change units they change, not by rescaling in the usual way, but by addition 
or subtraction. Secondly, the important role of the thermal wavelength is concealed in 
this formula. 

The main advantage of this formula is that it separates out three contributors to 
the entropy per particle: 


• The volume available per particle, V/N. The bigger this is, the more entropy the 
gas has per particle. 


• The temperature, Т. The bigger this is, the more entropy per particle. 
• The particle mass, m. The bigger this is, the more entropy per particle. 


The first two should be rather intuitive. But what about the third? We need to 
combine V/N and T' with the particle mass m and some constants of nature to get a 
dimensionless quantity, which we can then take the logarithm of. This leads us straight 
to the thermal wavelength: 


V 3 3, 2mm V (2amkT)3/? 
In A | 5 In KT | 5 In h2 = In ——NB 
n NAS 


Thus, my best explanation of why a gas of heavier particles has more entropy per 
particle is that they have a shorter thermal wavelength, so we can specify their position 
more accurately, and it takes more information to do so. 
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WHERE ARE WE NOW? 


The mystery: why does each molecule of hydrogen have ~ 23 bits 
of entropy at standard temperature and pressure? 


The goal: derive and understand the formula for the entropy of a 
classical ideal monatomic gas: 


= дь opri 
= n — + -ln 
N? Y 


including the mysterious constant ^y: 


27m 5 
zx "d 
3" “ш 


The subgoal: compute the entropy of a single classical 
particle in a 1-dimensional box: 
27m 1 


1 1 
S—klInL- КТ 1 
(m А ee ae 


The sub-subgoal: explain entropy from the ground up, and 
compute the entropy of a classical harmonic oscillator: 


kT 
s=k (n= +1) "d 
hw 


Okay, now we know the entropy of a classical ideal monatomic gas! We even know 
what it means. Unfortunately we’re trying to figure out the entropy of hydrogen, which 
is diatomic. But we can do helium, which is monatomic... and then we'll do hydrogen. 


105 


ENTROPY PER MOLE VERSUS BITS PER MOLECULE 
A nat of unknown information is 1.380649 - 10723 joules/kelvin of 
entropy: this is Boltzmann's constant. 


There are 6.02214076 - 10?? molecules per mole: 
this is Avogadro's number. 


Thus, one nat of unknown information per molecule corresponds to 


1.380649 - 107?? x 6.02214076 · 102° zz 8.314463 


joule/kelvin of entropy per mole. 


A bit is In 2 z 0.69315 nats, so one bit of unknown information 
per molecule corresponds to about 


0.69315 x 8.314463 = 5.763146 


joule/kelvin of entropy per mole. 


Here is a little fact we need now: one bit of Shannon entropy per molecule equals 
about 5.76 joules/kelvin of Gibbs entropy per mole. I apologize for the oppressively 
large number of decimal places above, but I want to compare our theoretical predictions 
of the entropy of helium and hydrogen to experimental results, and it's not clear yet 
how closely our answers will match experiment, so it's good to be prepared. 

By the way, the values of Boltzmann's constant and Avogadro's number here are 
exact, fixed by the definition of SI units. So there is no experimental uncertainty in any 
of the numbers on this page. 


106 


THE ENTROPY OF HELIUM: THEORY 


The Sackur- Tetrode equation says that assuming helium is a 
classical ideal monatomic gas, its entropy is 


which corresponds to 


v |5 
п =. 
МАЗ 2 


nats of unknown information per atom. At standard temperature 
and pressure, this gives about 15.041 nats or 


15.041 


In 


~ 21.700 


bits of unknown information per atom. 


Now let’s calculate the entropy of helium in its gaseous state. NIST has tabulated 
its entropy at standard temperature and pressure, specifically temperature Т' = 298.15 
К and pressure Р = 1 bar, so that’s what we'll try to calculate. An atom of helium has 
a mass of m = 6.646477 - 107?" kg, so at standard temperature its thermal wavelength 
is 
h 


N n = 
v 21 mkT 


6.62607 - 107—3 Js 
(2m x 6.646477 - 107?" kg x 1.380649 - 10-73 J/K x 298.15 К 


0) 


0) 


5.053721 - 1071 m. 


For a mole of an ideal gas we have N = 6.02214076 - 10° (this is Avogadro's number), 
and at standard temperature and pressure a mole of ideal gas has V œ 0.0247896 m?: 
this is called its ‘molar volume’. The molar volume of helium is actually slightly different 
from this, because helium is not an ideal gas: the atoms interact. But since we're doing 
a calculation assuming helium is a classical ideal gas, let's ignore that for now. We then 
get 
V 0.0247896 m? 
МАЗ ^ 6.02214076 - 1023 х (5.2799291 - 10-1 m) 


We thus have 


= ^ 279663. 


V 
In NAS x In 279663 ~ 12.541. 


As explained earlier, this means that the logarithm of the number of accessible states 
of each helium atom would be 12.541 if it were trapped in its own small box of volume 
V/N. But density fluctuations contribute 1 extra nat of entropy per atom. Thus, the 
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free energy part of the entropy per atom is 13.541 nats. On the other hand, the expected 
energy part of the entropy per atom is 2. coming from the atom's 3 momentum degrees 
of freedom. The total entropy per atom is thus 
V 3 
In — +1+ = % 15.041 
“лз t3 
nats. 
To impress our friends we can convert this to bits: we divide by In 2 and get about 


15.041 


— — — = 21.700 
0.69315 


bits of unknown information per atom of helium. 

I've kept only 5 significant figures in the later stages of these calculations, since 
that's how precise the experimental data is. Next let's compare the final result to 
experiment! 
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THE ENTROPY OF HELIUM: EXPERIMENT 


The entropy of helium at standard temperature and pressure has 
been measured to be 
126.15 joules/kelvin per mole. 


One bit of unknown information per atom corresponds to about 


5.7631 joule/kelvin of entropy per mole. 


Thus, each atom of helium at standard temperature and pressure 
carries about 


bits of unknown information. 


Experimentally, the entropy of helium at standard temperature and pressure is 
126.15 joules/kelvin per mole. Converting this to bits per atom we get 21.889, very 
close to our theoretical result of 21.700, but about 0.9% higher. 

There are a couple of possible reasons for this slight discrepancy. First, while our 
theoretical calculation assumed that helium is an ideal gas of noninteracting point 
particles, this is not true. The helium atoms interact! 

Second, our computation ignored quantum effects—except for using Planck's con- 
stant to determine the thermal wavelength. Even for an ideal gas, quantum effects 
become important when V/NA® ceases to be large. This happens at high densities 
V/N, low temperatures Т, or for particles of small mass m. Helium has а low mass as 
molecules of gas go—and our ultimate goal, hydrogen, is even worse. 

Now let's tackle the final summit: hydrogen. This is a diatomic gas, so it works 
differently. 
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THE IDEAL DIATOMIC GAS 


In thermal equilibrium, a classical ideal diatomic gas of N 
indistinguishable molecules of mass rn in a 3-dimensional box of 
volume V has expected energy 


5 
(Е) = „КМТ 


and free energy 


1 VN 
Е = —kT ln | — 


М! ASN 
where A = h/4/2: mkT is the thermal wavelength. 
Its entropy S is ((Е) — F)/T, so 


=k 1 id 2 kl ! 


and using Stirling's formula In N! z (In N —1)N we get 


It's easy to repeat our computation of entropy for a diatomic gas if we recall that 
the tumbling of the molecules add two degrees of freedom to the three for position, 
giving (E) — 2kN T. Tracking the effects of this change we see the entropy is higher 
than for a monatomic gas. To be precise, the entropy of a classical ideal diatomic gas 


1S 


ү 7 
S x kN ( z) А 
"NA *2 
So, it has one more nat of Shannon entropy per molecule than an ideal monatomic gas! 
Let's see how this plays out for hydrogen. 
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THE ENTROPY OF HYDROGEN: THEORY 


Assuming hydrogen is a classical ideal diatomic gas, its entropy is 


Sz kN (In a 
i "NAS 2 


which corresponds to 
V n T 
n == 
NA? 2 


nats of unknown information per molecule. 
At standard temperature and pressure, this gives 15.144 nats or 


15.144 


In 


~ 21.848 


bits of unknown information per molecule. 


A hydrogen molecule has m = 3.34706 - 107?" kg, so at a temperature Т = 298.15 
K its thermal wavelength is 


h 


A= — 
v 21 mkT 


6.62607 - 10 ?!Js 
\/2т x 3.34706 - 107?" kg x 1.380649 - 10723 J/K x 298.15 К 


Q 


Q 


7.12156 - 1071 m. 


For a mole of an ideal gas at standard temperature and pressure, № = 6.02214076. 10?? 
and V ғ 0.0247896 m?, so 


И ЖИ 0.0247896 m? 
NA? ^ 6.02214076 - 102 х (7.12156 - 107! m) 


z © 113971 


We thus have 


V 
In NA In 113971 z 11.644 

Thanks to our previous work we know this means that that the logarithm of the 
number of accessible states of each molecule would be 11.644 if it were trapped in its 
own small box of volume V/N. There is also a correction to this simplified picture due 
to density fluctuations, which gives 1 extra nat of entropy. These add up to give the 
free energy contribution to the entropy per molecule: 12.644 nats. This is less than 
we got for helium. But the expected energy contribution to the entropy per molecule 
is larger: we again get 3 nats from the molecule's 3 momentum degrees of freedom, 
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but now we get 1 extra nat due to its 2 extra tumbling degrees of freedom. The total 
number of nats of unknown information per hydrogen molecule is thus 


V 3 
In — + 1 +- +1 15.144. 
п туз + ++ 5 


Finally, the number of bits of unknown information per hydrogen molecule is 


15.144 
0.69315 


x 21.848. 


This is slightly more than for helium, where the number was 21.700. 

As a sanity check, let's do this calculation a different way. А hydrogen molecule 
is close to half the mass of a helium atom, so its thermal wavelength should be 4/2 
times as large. In our calculation we're treating V/N as the same for both gases, so 
hydrogen's V/N A? should be 273/2 times as large as that for helium. Since ultimately 
we compute bits by taking a logarithm in base 2, this reduces its entropy per molecule 
by 3/2 bits. However, hydrogen's 2 tumbling degrees of freedom increase its entropy 
per molecule by 1 nat, or 1/1n2 bits. We have 


3 1 
== + — 7a —1.5 + 1.448 ~ —0.057. 
2 i In 2 С 
This suggests that each hydrogen molecule should carry 0.057 fewer bits of unknown 
information than each helium atom. Why did our more careful calculation say hydrogen 
should have about 
21.848 — 21.700 ғ 0.148 


more bits of unknown information per molecule? What's the mistake? 

The slight discrepancy arises solely from the fact that a hydrogen molecule is not 
exactly half the mass of a helium atom! It's a bit heavier. It's actually more like 0.50358 
times the mass of a helium. This makes its thermal wavelength a bit smaller than our 
estimate in the last paragraph, which boosts its entropy. It's nice that such subtleties, 
ultimately due to nuclear physics, are showing up here. 

By the way, all our calculations have been for the most common isotopes of hydrogen 
and helium: hydrogen whose nucleus consists of a single proton, and helium whose 
nucleus consists of two protons and two neutrons. Other isotopes have significantly 
different mass, and this changes the entropy values significantly. 


Puzzle 48. Helium has a lighter isotope called helium-3, whose nucleus is made of two 
protons and just one neutron. The mass of helium-3 is 5.00823 x 107?" kg. If we repeat 
our calculation of the entropy of helium at standard temperature and pressure, changing 
only this mass, what value do we get for the bits of entropy per atom of helium-3? 


Puzzle 49. Hydrogen has a heavier isotope called deuterium, whose nucleus is made of 
one proton and one neutron. The mass of a hydrogen molecule made of two deuterium 
atoms is 3.34449 x 107?" kg. If we repeat our calculation of the entropy of hydrogen at 
standard temperature and pressure, changing only this mass, what do we get for the 
bits of entropy per molecule of this sort? 
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THE ENTROPY OF HYDROGEN: EXPERIMENT 


The entropy of hydrogen at standard temperature and pressure 
has been measured to be 
130.68 joules/kelvin per mole. 


One bit of unknown information per molecule corresponds to 
about 5.7631 joule/kelvin of entropy per mole. 


Thus, each molecule of hydrogen at standard temperature and 
pressure has about 


130.68 . 
5.7631 . 


22.675 


bits of unknown information. 


Okay, let’s compare our theoretical prediction to experiment. 

The experimental figure for the entropy of hydrogen at standard temperature and 
pressure is 130.68 joules/kelvin per mole, which translates into 22.675 bits per molecule. 
This is larger than our theoretical prediction of 21.848 bits per molecule by about 3.8%. 

That’s not bad. We can say we solved our original problem fairly well. But the 
percentage error here is about 4 times worse than it was for calculation for helium. 
Why is it worse? 

I haven’t studied this, but I can imagine two reasons. First, remember that quantum 
effects kick in when V/NA® ceases to be large. This quantity is a bit smaller for 
hydrogen than for helium. Remember, for helium it was 279663 at standard temperature 
and pressure, while for hydrogen it’s 113971. But that’s still very large, so I imagine 
quantum effects are still quite tiny. 

Second, hydrogen molecules are not chemically inert like helium atoms, and they’re 
larger, and diatomic rather than monatomic. So I’d expect them to interact more, 
making the ideal gas approximation worse. This feels like a more plausible explanation 
for the 3.8% discrepancy. 


Puzzle 50. Do research to find more accurate calculations of the entropy of hydrogen 
gas. What are the main sources of error in the calculation we have done here? 
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WHERE DID WE GO? 


The mystery: why does each molecule of hydrogen have ~ 23 bits 
of entropy at standard temperature and pressure? "4 


The goal: derive and understand the formula for the entropy of a 
classical ideal monatomic gas: 


Sco ш ac Dita 
= lic 
м2 Li 


including the mysterious constant ^y: 


27m 5 
E ТА 
T au ts 


The subgoal: compute the entropy of a single classical 
particle in a 1-dimensional box: 


27m 1 
h? 2 


1 1 
s =k (mre i mkT + Fin 


The sub-subgoal: explain entropy from the ground up, and 
compute the entropy of a classical harmonic oscillator: 


kT 
s=k (n= +1) d 
hw 


We're done! Or at least we reached our stated goal. But there is a lot more to 
say about entropy. In a way we've scarcely scratched the surface. For more on the 
mathematics of entropy, I recommend these books: 


• Thomas A. Cover and Joy A. Thomas, Elements of Information Theory, Wiley- 
Interscience, New York, 2006. 


* Tom Leinster, Entropy and Diversity: the Axiomatic Approach, Cambridge U. 
Press, Cambridge, 2021. Also free on the arXiv. 


For classical and quantum statistical mechanics, I recommend these: 


• Frederick Reif, Fundamentals of Statistical and Thermal Physics, Waveland Press, 
Long Grove, Illinois, 2009. 


• Dirk Ter Haar, Elements of Statistical Mechanics, Elsevier, Amsterdam, 1995. 


The second one has an intense focus on our friend the box of gas. And for the principle 
of maximum entropy, I again recommend this insightful and opinionated text: 


e E. T. Jaynes, Probability Theory: the Logic of Science, Cambridge U. Press, 
Cambridge, 2003. 
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THE FIRST LAW OF THERMODYNAMICS 


Suppose a system has some measure space X of states 
with functions called energy E: X — К 
and volume V : X — К. 


Consider probability distributions on X 
maximizing the Gibbs entropy S subject to constraints 
on (E) and (V ). 


Then as we vary (E) and (V) we have 


d(E) = TdS — Pd(V) 


where T' is called temperature and P is called pressure. 


I said we were done. But what kind of course on entropy doesn't cover the three laws 
of thermodynamics? I talked a bit about the Third Law, but I haven't even mentioned 
the other two yet. 

Here's why: this wasn't a course on thermodynamics. In 'classical thermodynamics' 
there's a tradition of taking concepts such as energy, work and heat as primitive, and 
treating the laws of thermodynamics as axioms. Гуе instead been explaining a bit of 
'classical statistical mechanics', where we start with probability theory and attempt 
to derive classical thermodynamics. In this approach the laws of thermodynamics are 
not fundamental. They actually look a bit odd: they become results that hold under 
various conditions, so each one becomes a collection of theorems and conjectures. 


I'll state versions of the three laws of thermodynamics in the language we've devel- 
oped here. But please be aware that my versions are idiosyncratic and will make some 
people raise their eyebrows. I'm afraid you'll have to go elsewhere, like Reif's book, to 
learn these laws in their traditional form! 

We've been maximizing entropy subject to a constraint on the expected value of 
one quantity. What if we do two—or more? Everything works the same way, but the 
fundamental relation between temperature, energy and entropy, d(E) = TdS, gets one 
extra term for each constraint. The resulting equation is a version of the ‘First Law of 
Thermodynamics’. 

ГЇЇ explain the case with one extra constraint. Suppose we’ve got a measure space 
X whose points are states of some system. Choose two functions on it. They could be 
anything, but let's call them energy and volume and write them as E: X — R and 
V: X — R. These terms are favored because thermodynamics arose in part from the 
study of steam engines, where you've got a cylinder of steam with some energy and 
some volume. For any probability distribution p: X — [0, со), we can write down a 
formula for its Shannon entropy 


He - f plz) In p(x) dx 
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and also the expected values 


n = /, E(z)dz, (ү) = |. V (2) dz. 


Let's not worry now about whether these integrals converge. 

Suppose we only know (Е) and (V), and we are trying to choose the ‘best’ prob- 
ability distribution p with these expected values. What should we do? Following the 
principle of maximum entropy, we seek the probability distribution p that maximizes 
H subject to our constraints on (E) and (V). If we do this, we are led to a Lagrange 
multipliers problem, much as in the the simpler case of one constraint. But now we 
need two Lagrange multipliers: let's call them 6 and y. We get this equation: 


dH = Bd(E) + yd (V). 


This is the First Law! 

But this isn't the way physicists usually write it. To get the First Law in its usual 
form, first let's switch to using Gibbs entropy S = kH, and emphasize the role of energy 
by solving for d(E): 

1 y 
d(E) = —dS — —d(V). 
(E) = 555 – Fav) 
Then, to simplify the look of this equation, let’s introduce variables called temperature 
and pressure: 
^Y 
fea Pm 
kB B 
Now the First Law of Thermodynamics looks like this: 
d(E) = TdS — Pd(V). 


It says that as we move around among probability distributions that maximize entropy 
subject to constraints on expected energy and volume, the change in expected energy 
is the sum of two terms: 


* heat, meaning TdS 
• work, meaning — Pd(V). 


For example, if we have a cylinder of steam with pressure P and we increase its expected 
volume by a little bit A(V), its expected energy goes down by about PA(V): that's 
how we understand the minus sign. In this situation the external world has done an 
amount of work —PA(V) on the cylinder of steam, but most people say the cylinder 
of steam has done an amount of work PA(V) on the external world. 

Here are а few puzzles if you want to dig deeper. In the first two, I ask you 
to generalize ideas from our earlier work on maximizing entropy subject to a single 
constraint. 


Puzzle 51. Let X = {1,...,n} and let E,V: X — R be two functions whose values 
at i € X we call E; and V;. Suppose p is a probability distribution maximizing the 
Shannon entropy H on the surface where 


(E)=e, (V)=2, 
and also suppose р,...,рһ > 0. Show that at p we have 
dH = Bd(E) + yd(V) 
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for some 8, у Є К. (Hint: first do the case where not all the E; are equal and not all 
the V; are equal. This guarantees that d( E) and d(V) are nonzero. You can handle the 
other cases separately.) 


Puzzle 52. Under the conditions of Puzzle 51 show that 

_ exp(- BE, - УИ) 
>> exp(- BE; — ҮМ) 
i=l 


i 


Puzzle 53. Generalize the results of Puzzles 51 and 52 to the case of any finite number 
of constraints. 


Puzzle 54. Generalize the results of Puzzle 53 to the case of a system with a count- 
able infinity of states, or an arbitrary measure space of states. You will need to add 
assumptions to ensure that the sums or integrals converge. 


117 


THE SECOND LAW OF THERMODYNAMICS 


Suppose a system has some measure space X of states 
and at any time t there is a probability distribution p(t) on X. 


We say the second law of thermodynamics holds if 


ti € t? == S(p(ti)) € S(p(t2)) 


This seems to be widely true, yet the conditions under which 
it holds are subtle and much-argued. 


The Second Law of Thermodynamics, as commonly stated, says that the entropy 
of a closed system never decreases. This appears to be a profound fact about our 
universe. А huge challenge to physics is to understand where this law comes from. Can 
it be derived from some realistic assumptions? One problem is that the laws of classical 
mechanics are invariant under time-reversal. Thus, if we evolve probability distributions 
on some space of states according to these laws, for any probability distribution whose 
entropy is nondecreasing, there is a time-reversed one whose entropy is nonincreasing. 

This is called the problem of the arrow of time: briefly, why does the future look so 
different from the past? Quantum mechanics makes the problem subtler, but does not 
provide an easy resolution. The solution may be that we happen to live in a universe—a 
particular solution of the laws of physics—where entropy was very low at the Big Bang, 
making it easy for entropy to increase after that. But if you get ten physicists in a 
room and ask them to explain the arrow of time, you are likely to hear ten different 
opinions. Thus, I will not attempt to resolve it here. For more on that, I recommend 
this book: 


• H. D. Zeh, The Physical Basis of The Direction of Time, Springer, Berlin, 2010. 


Instead, let's see how the Second Law sheds light on the meaning of temperature. 
You'll notice that in our course I never talked about systems evolving in time, and I 
never talked about two systems interacting: always just a single system. Now let's 
imagine two systems, each in thermal equilibrium, but at possibly different tempera- 
tures. Say the first has entropy 51, expected energy (Е) and temperature Ту. As usual, 
these are related by 


_ aE) 
do, TC 
Say the second system works similarly, with 
_ 42) 
155 = nC 


We can define the total entropy of the two systems by 
© = $1 + So 
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and the total expected energy by 
(E) = n) + (E2). 


Suppose now that the two systems can exchange energy with each other, but in a slow 
and gentle way, so we can approximately treat each one as in thermal equilibrium at 
any moment. If no energy flows in or out of the combined system, the total expected 
energy is conserved, so 


dE) 
di — 
and thus 
4E) _ d(E) 
dt dt 
What does the Second Law give us in this situation? It implies 
dS 
— >00 
dic 
E d$, dS 
1 2 
—— + —= р> 0 
d d 


It follows that 
1 d(Fa) , 1 d(E5) S 


Ti dt To d ` 


or 
1d) 1d). 
T d Ty dt © 


We can rewrite this as 


1 1\а(Е)) x 
E i a dt = 
Now suppose both Т and 7 are positive. Then we get a remarkable consequence: 
as two systems exchange energy, with each staying in thermal equilibrium at every 
moment, expected energy can only flow from the system with higher temperature to 
the system with lower temperature! 


Puzzle 55. Suppose one or both of the temperatures Ту, 75 are negative. How does this 
conclusion change? 
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THE THIRD LAW OF THERMODYNAMICS: REVISITED 


If a system has countably many states, 
with just one of lowest energy, 


and thermal equilibrium is possible for this system 
for some temperature T > 0, 
then its entropy in thermal equilibrium approaches zero 
exponentially fast as a function of 1/T' 
as T' approaches zero from above. 


In our earlier work on the Third Law, we only studied systems with finitely many 
states. Later we saw how to compute entropy from the free energy and expected energy. 
This makes it a bit easier to handle systems with a countable infinity of states. In the 
following puzzles, which are only for the most devoted readers, let's use these ideas to 
prove and improve the Third Law for systems with countably many states. 

Earlier we worked with temperature, but it's cooler to use coolness. For all the 
following puzzles, let's suppose we have a system with a countable infinity of states 
n = 1,2,3,... with energies £,. Also suppose thermal equilibrium is possible for some 
Bo > 0, i.e., the sum 


Z(&) = У ехр(—%Е„) 


converges. (Our arguments also apply to systems with finitely many states, where this 
convergence condition is automatic.) 


Puzzle 56. Show that the system's partition function, expected energy, free energy and 
entropy are well-defined for all 8 > бо. 


Puzzle 57. Show that if we add some constant to the energy of each state 


Е, = Е. + с 


we get a new 'shifted' system whose partition function, expected energy, free energy 
and entropy are related to those of our original system by 


2 = ехр(-8с)2, (Ё) = (Е) +с Ё=Е+с, 5 = 8 
for all 6 > до. 


Now further suppose that our original system has just one state of least energy. 
Earlier we saw that we could reindex the states so that E, < By € Es € --- and 
Ej, — +оо. The same is true of our new shifted system, and let's choose c = — E, so 
that the lowest energy of the shifted system is zero. With this shift we have 


О= É < É < Ез <... 


and Ё, > +оо. 
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Puzzle 58. Show that for any coolness 2 > fy we have 
208) —1= Y PB 
n=2 
Using this equation show 


|Z(B) — 1| < e- | (ву — 1| 


and thus | | 
IZ(8) — 1| < const e~ 4-4) #2 


for some constant independent of 8. Use the fact that F(8) = = In Z(8) to show that 
for large enough 6, _ | 
|Ё(8)| < const e- 6-895 


possibly for a different constant independent of 8. Using Puzzle 57, conclude that 
|F(8) — E| < const e- 6-802 Ез). 


Voilà! This shows that for a system with countably many states and just one state 
of lowest energy, if thermal equilibrium is possible at some positive temperature, then 
the free energy must approach this lowest energy exponentially fast as 3 — +oo. Now 
let's show something similar for the expected energy. Again we use the shifted system 
to simplify the calculations. I'll leave more work to you this time. 


Puzzle 59. Show that at any coolness 3 > бо we have 


Use this to show that 22 (B) goes to zero exponentially fast as 6 — +оо. Using 


> d 1 d 


(£)(8) = —;1nZ(B)-— -zoa 7 P) 


dB 


and Puzzle 58, show that (F)(() goes to zero exponentially fast as 8 — +оо. Using 
Puzzle 57, conclude that (E)(8) approaches E exponentially fast as 9 — --oo. Finally, 
since 


S = kB(F — (E)) 


and both F and (E) approach E; exponentially fast as @ — +оо, conclude that S 
approaches 0 exponentially fast as G — +оо. 


Let's summarize! Suppose we have a system with a countable infinity of states and 
just one state of lowest energy. If thermal equilibrium is possible for this system for 
some T' > 0, the Third Law of Thermodynamics says its entropy in thermal equilibrium 
goes to zero as Т approaches zero from above. But in fact we can say more: for some 


a,b > 0 we have 
|S(B)| са. 


for all large enough 5. 
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